A few months ago, I started using Tableau, the famous data visualization software. Coming from a more pythonic/coding background, I was prepared to dislike this software. I prefer to use Plotnine or Seaborn to make my plots, and interact with them easily if I need to add new data or ideas. Hard wiring visualizations, in some ways, goes against my regular work which is very much based on research and exploration.

One thing that I did not know, is what I could be using Tableau for. That is: making dashboards.

Dashboards. Get used to that word if you are coming from academia, because man, it's everywhere outside of it. And of course it's there: everyone wants dashboards, to see what's happening in the business. Thing is, in research, we don't have that (or at least I didn't...). You plan an experiment, collect the data for some weeks or months, analyze it, and eventually (maybe) publish it. No need of a dashboard to track some KPIs (another word absent in academia) since your experiment has a beginning and an end. I mean sure, we all peek a little bit at the data to see whether it's going in the right hypothesized direction, but normally, you wait for the whole dataset to be there, and then run your analysis.

Anyway, so I had to make dashboards. And Tableau, for that purpose, really turned out to be a great tool. Because it is user friendly (to both make and use dashboards), and accessible by many users if you have Tableau Online. I did not test many other softwares out there, although I did perform a quick comparison with Google Data Studio, and at first, Tableau seemed much more intuitive and easy to start than Data Studio, but don't take that for granted.

However, I must say that I was very disappointed by a few features in Tableau where I expected a much higher level. While the visualization was great, the data navigation and computation was relatively low level (at least for someone used to work directly with code on cloud computing resources).

Since I struggled a bit to find the right balance between what to do in Tableau and what to do before hand, I thought I would share a bit my impressions here.

Before I start, let me give some positive feedback (because I am going to focus on the things I struggled and am unhappy with). Tableau really outputs nice data visualizations, with a lot of interactive power, and sexy shapes that will produce a "wow" effect. Like or not, that wow effect is quite an important component when trying to secure funding or convince reluctant stakeholders, so I think it has is value. Once you get used to it, you can really produce very useful dashboards with little to no maintenance needed, as long as the data fed to it keeps the same structure.

I am quite happy with the result I came up with (which I cannot share here unfortunately), and many users were satisfied with the final product.

Now that being said, there are other aspects that I found quite annoying and unintuitive, so let's focus on that for the rest of the post.

A Tale of Two Major Issues¶

1. Loading datasets¶

So Tableau is great for visualization, it's beautiful and with little knowledge of the software you can make nice looking plots (unlike with Plotnine or Seaborn which require coding skills).

However, there are a couple of things I found annoying when loading data...

1) Modifiying your dataset
Say you want to add new columns (new KPIs) to your dataset because business requested it. What I often found is that Tableau will not recognize the dataset anymore if columns are added between existing columns. Playing around between Live and Extract might help solve the problem, but I am not sure how...What often helps is to use the Data Preparation option, since the code behind tries to make sense of the dataset you are now using. It does not always work, but it's worth a try. Once it is done once, you can uncheck that box, and Tableau will keep on recognizing new columns.
Generally, a good practice is to keep on adding columns at the end of your dataset (or so I was told by a Tableau specialist), and that should solve your problem. But you might not want to have that column order, in which case good luck. This is particularly tiresome if you your dashboard is incremental and news features need to be added with time.

Altogether I do find these to be real setbacks, in particular in early phases of dashboarding where new KPIs come in relatively frequently.
2) Loading time
Another thing that I find suboptimal is the loading time, once you dashboard is ready. If often takes long to load data (maybe 30s to load a few thousands of lines, and I have 32GB RAM on my machine), and changing a few filters can take up to 15s, which is a lot when you are doing a presentation on the data.
This might be due to the connector I use. A few attemps suggested that the speed was a bit (but not much) heigthened with Big Query connectors instead of Google Drive.
Regardless, I would suggest you take that into account depending on the kind of use you will make (official presentation vs separate consultation by different users). In the worst case, consider making a copy of the data locally to avoid any lags.

2. Preparing data¶

The data preparation abilities of Tableau is probably what I dislike the most about it. Sure you can do some aggregrations such as means and so on, and you can run some GroupBy function (using something called Level of Details, LOD) by Fixing some parameters and aggregating on variables. But which software cannot do that?

Say you want to compute an index over the last 12 months over items that have similar words as your items of interest, and that share a few common parameters. In python, a few lines of code would allow you to locate that data, do whatever computation you can imagine, and output an index you could use for comparison.

Now don't get me wrong: you can do that in Tableau as well. You would have to use Date functions to screen dataset in the last 12 months, then screen the words, and use IF closes to select the cases where the parameters are common.

But the syntax would be heavier, and the computations slower then doing this using code. Of course it is faster with code you might say ;)

But it is not only about the syntax and computations, it is about sanity checks, to make sure that you are doing the right thing. Coding gives you a flexibility in data visualization that Tableau simply has not.

Add to that the fact that navigation can quite slow (see point above), and it overall becomes quite a heavy approach to to in Tableau.

I give a few tips later on in this post on how and which data you can prepare before landing in Tableau.

My Tableau Pipeline¶

Given the points above, I decided that the best would be to prepare all the data before hand, and to deliver a final dataset to Tableau.

Getting data¶

Tableau has quite some connectors that can be useful to interact live with datasets. In turn, that allows your visualizations to always use fresh data, if the dataset is updated of course.

Two that I used mainly are the Google Big Query connectors and the Google Drive

In [3]:

PATH = "/home/julien/website/content/images/2020_09_tableau/"
Image(filename = PATH + "BigQuery.PNG", width=800, height=500)

Out[3]:

As you an see above, you can set up a connection through your Google Account to a table on Google Storage. The process is extremely easy, and allow you to access data on the cloud. Simply select your Project and the datasaet you want to connect you, and you are good to go! I find that solution quite elegant if all your data is on GCP.

In [4]:

PATH = "/home/julien/website/content/images/2020_09_tableau/"
Image(filename = PATH + "connectors.PNG", width=800, height=500)

Out[4]:

The picture above also shows you the connection to Google Drive. The idea here is to have a Google Sheet with your data in it, and Tableau connects to it. I find that solution quite nice if the users will be unfamiliar with cloud technologies, but still want to play around with "raw" data themselves. Most of these users will know the "excel" format (similar to Google Sheet), and will be able to download a version of that dataset for data exploration.

This solution is also quite good if your organisation is now yet fully cloud based. You can get your data from different sources, and munge it together to prepare a sheet on Drive, that you can share and link to Tableau.

Personally, I decided to use Google Drive to allow other users to access a friendly data format.

Note that these solutions are not incompatible. Actually, they probably are better used in combination. Using only Google Drive makes it easy for a unexperienced user to break something in the dataset on Drive, which, until the moment where you update it, will lead to a broken dashboard. Instead, exporting your data on Drive as a Google Sheet while feeding Tableau from Google Big Query allows only experienced cloud users to mess with the data, which is much more unlikely to happen ;)
But again, your data should be cloud based, which is not always the case

Your pipeline should get the data where it is: on the cloud if you are fully cloud based (containe your code in a docker image, and run the image that would get the data from Big Query for example); from internal platforms if you are not yet on the cloud.

In my case, since my organization if migrating on the cloud still, I centralize many datasets required on an FTP server, and fetched my data from there directly (while the data is not cloud-based).

I wrote the function below to get data from FTP servers:

In [3]:

import pysftp
def import_from_sftp(myHostname,myUsername,myPassword,filename,ftp_subfolder,localPath):
    """
    Import data from ftp server
    :param myHostname (str), ftp server host that needs to be reached
            myUsername (str), username to log in ftp
            myPassword (str), password used to login
            filename (str), file to be fetched
            ftp_subfolder (str), subfolder location in ftp server (optional)
            localpath (str), local destination of fetched file
    :return: copied file locally
    """
    cnopts = pysftp.CnOpts()
    cnopts.hostkeys = None
    # If file is in the FTP root, do not use that input
    if ftp_subfolder == '':
        remoteFilePath = '/' + str(filename)
    else:
        remoteFilePath = '/'+str(ftp_subfolder)+'/'+str(filename)

    print(remoteFilePath)
    
    localFilePath = localPath+ str(filename)

    with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword, cnopts=cnopts) as sftp:
        print("Connection successfully established. Copying file... ")
        print('Downloading...')
        sftp.get(remoteFilePath, localFilePath)  # get a remote file
    sftp.close()
    print('fetched ' + str(remoteFilePath) + ' from sFTP server.')
    print('')

Processing data¶

I won't tell you here how I process the data: that's just python syntax based on request from business and dashboard needs.
However, I suggest that you place your code as an online repository hosted on your company's preferred platform (BitBucket, GitHub etc...). Even if you work alone! Of course document it and keep it clean. Eventually, more people will start using that code, and add new KPIs to the different pipelines and dashboards. Using that system from scratch would allow the code to be tracked and modified by different users. Using GitFlow would allow you to test new releases and fix bugs easily, while the business makes new demands about which KPI should be added to Dashboard X, and which dashboard should be added to the general portfolio.

That code could be preparing one or multiple datasets, that would feed into one or different dashboards.

Prepare most but not all data¶

Let's assume that you are preparing a Digital Communication dashboard for ECommerce company. You might need to show data such as items sent, opening and clicking rates, number of buyers, number of products ordered, types of articles, monetary KPI, etc...

What I recommend here is to prepare almost all of your data before hand using Python or R, or whatever floats your boat. Almost because it would be better to run some light aggregates in Tableau directly I think.

Let's take the example of opening rates. You have a column with data varying between 0 and 1 (or 0 and a 100 if you already went to the hundred scale). Good. Now typically you will have different opening rates, maybe for different communication sent, or different countries. If you already pre compute that opening rate from raw data, what will happen is an automatic grand average which will take these pre-computed opening rates, and run a second mean calculation on them.

Example:
I create some random number of item sent and opened, which I calculate the opening rate from

In [5]:

import pandas as pd
# Actual raw data of absolute number sent and absolute numbers opened
data = {
    'sent': [61520, 55308, 61201, 153950, 102715, 185599, 189025, 353909, 102184, 554903], 
    'opened': [15677,17371,17423,24544,26086,51829,38389,45615,20699,86901]
        }
# Put data in dataframe
df = pd.DataFrame(data, columns=data.keys())
# Compute opening rates per row
df = df.assign(opening_rate = lambda d: d.opened/d.sent)
df.opening_rate

Out[5]:

0    0.254828
1    0.314078
2    0.284685
3    0.159428
4    0.253965
5    0.279253
6    0.203090
7    0.128889
8    0.202566
9    0.156606
Name: opening_rate, dtype: float64

We have our opening rates, now let's take:

Grand average: (that is the mean of these means)

and compare it to

Mean from raw data: that we obtain when we use sents and opened data.

In [6]:

print("Grand average = " + str(df.opening_rate.mean()))
print("Average from raw data = " + str(df.opened.sum() / df.sent.sum()))

Grand average = 0.22373862524607718
Average from raw data = 0.18927174102929495

You can see that the difference is quite substantial, although both averages come from the sama raw data. One result was processed twice while the second one comes directly from the raw data.

That can be a problem in Tableau if you already pre-compute many data features, thinking that it will make your life easier.
You could still include both raw data and compute opening rates in your dataset, sure. But believe me, some people will be using these opening rates to run grand averages without knowing about it. It is better to include the raw data, leave out the processes opening rates, and calculate them in Tableau usig aggregation. If you need to calculate them for each item/country (example above), just use an LOD with Fixed dimensions to compute your opening rates.

Connecting the data to Tableau¶

Your pipeline should provide ways to export the data to the place where Tableau gets its data from.

Again, best solution is to have your data on the cloud, your pipeline on the cloud, and your Tableau connecting to the cloud :)

But if you're not there yet, there are a couple of other options you could use.

For instance, you could copy back the processed data on the FTP server and have Tableau connect to it.

One option would be to create a copy of the dataset to a location that can be accessed via Tableau Bridge or Tableau Online (see here connectivity of Tableau Bridge and Tableau Online).
A second option would be to automatically create a hyper extract of the data on the FTP server using the Hyper API. In this case you could schedule a script that is using the Hyper API to create an extract that can then be refreshed/published to Tableau Online. If that interests you, check the doc online.

If you want to export your data on an FTP server, I wrote a simple function to do so, feel free to use it:

In [11]:

import pysftp
def copy_file_on_ftp(myHostname,myUsername,myPassword,filename,localpath):

    remotepath = '/Export/'+ str(filename)
    os.chdir(localpath)
    print('')
    print(localpath)
    print('')
    cnopts = pysftp.CnOpts()
    cnopts.hostkeys = None

    with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword, cnopts=cnopts) as sftp:
        print("Connection succesfully established ... ")
        print('Exporting...')
        with pysftp.cd(localpath):
            sftp.put(localpath=localpath+filename, remotepath=remotepath)
    sftp.close()
    print('Export to sFTP successful!')
    print('')

Another option would be to use the Google Drive solution I mentioned earlier in the post, to export your data over there and have Tableau connect to that dataset through the Google Sheet connector.

I used that solution for one dashboard, and I found it quite handy. One important aspect here would be to create an authentification token that would skip the whole manual identification process.

The functions export_file_on_google_drive below would export the file on a Google Drive account. You would however need to create a client_secrets.json file to allow for the script to reach the correct account. But it would still require manual identification. That solution was taken from here.

In [ ]:

import os
import pydrive
import pysftp
from pathlib import Path

def export_file_on_google_drive(source_file,destination_file):
    from pydrive.auth import GoogleAuth
    from pydrive.drive import GoogleDrive

    path = Path(os.getcwd())
    os.chdir(str(path.parent.parent) + '\dat_ini')

    # Authentication steps
    # See
    gauth = GoogleAuth()
    gauth.LocalWebserverAuth()
    drive = GoogleDrive(gauth)

    # List files in Google Drive
    file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()

    # Find file to update
    for file1 in file_list:
        print('title: %s, id: %s' % (file1['title'], file1['id']))
        if file1['title'] == source_file:
            file_of_interest = file1

    file_of_interest.SetContentFile(destination_file)
    file_of_interest.Upload()
    return source_file + 'was exported to Google Drive as ' + source_file

That solution is OK I guess, but quite annoying if you need to export data on Drive regularly. Every time, you will have to authenticate, which takes 5 seconds and 2 clicks, but it's 5 seconds and 2 clicks too much I think ;)

The function below takes one step further by creating a credential file that needs to be created once through authentification, and then allows the code to access the Drive directly. This mycreds.txt file is required on top of the former client_secrets.json file. The presence of that file in the folder would allow the script to directly access the Drive account, without need for manual identification. This solution was taken from here.

In [12]:

def export_file_on_google_drive_no_id(source_file,destination_file):
    # This approach will require identification the first time, and then not again
    from pydrive.auth import GoogleAuth
    from pydrive.drive import GoogleDrive

    from pathlib import Path

    path = Path(os.getcwd())
    os.chdir(str(path.parent.parent) + '\dat_ini')

    # Authentication steps
    # See
    gauth = GoogleAuth()
    gauth.LoadCredentialsFile("mycreds.txt")
    if gauth.credentials is None:
        # Authenticate if they're not there
        gauth.LocalWebserverAuth()
    elif gauth.access_token_expired:
        # Refresh them if expired
        gauth.Refresh()
    else:
        # Initialize the saved creds
        gauth.Authorize()
    # Save the current credentials to a file
    gauth.SaveCredentialsFile("mycreds.txt")
    drive = GoogleDrive(gauth)

    # List files in Google Drive
    file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()

    # Find file to update
    for file1 in file_list:
        print('title: %s, id: %s' % (file1['title'], file1['id']))
        if file1['title'] == source_file:
            file_of_interest = file1

    file_of_interest.SetContentFile(destination_file)
    file_of_interest.Upload()
    return source_file + 'was exported to Google Drive as ' + source_file

Final notes¶

That's it! I hoped that helped a bit. Feel free to get in touch if you want more information about the actual pipeline. I am not allowed to share that, but I would be happy to discuss it directly if you are encountering some issues.

opinion: My Tableau usage, combined with Python