WRITE TO GOOGLE BIGQUERY USING TABLEAU PREP
--
Data preparation has historically been an IT function, but the data landscape has evolved. Self-service data preparation tools put the power in the hands of the people who know the data the best while reducing the burden on IT to prepare it. But self-service data prep is still a brand new skill set. It needs to be developed and rolled out to enable users to understand and perform prep functions effectively, establish repeatable processes, automate them for efficiency, and ultimately build trust and confidence in them the data for wider use.
You might have already seen Tableau as a visual analytics tool in action. But the final result of visualizing your data is only the tip of the iceberg. Before you bring data into a data prep tool, it’s important to understand what you’re working with, and you need to know whether you’re looking at the entire data set or only a subset. You may also need to do some data exploration before you start cleaning.
In reality, data analysts spend a huge amount of their time cleaning/preparing data. This process is quite tedious. However, it’s most important; you probably know the statement: “Trash in trash out,” So it’s essential to get the data cleaned and in the right format. A great tool released by tableau back in 2018 to help us do this easily and efficiently is Tableau Prep.
Before starting this article, let me give you an overview of Tableau Prep. Tableau is a brand-new product designed to help everyone quickly and confidently combine, shape, and clean their data for analysis. The direct, visual experience gives you a deeper understanding of your data, and smart experiences make data prep easier and more accessible. Plus, it’s seamlessly integrated with the Tableau analytical workflow, making it easy to go from data prep to analysis. To know more about tableau prep, click on the KB article (*opens a new tab) and start your data preparation journey.
With 2021.2, Tableau Prep expands its output capabilities to include Google BigQuery, in addition to the existing databases outlined below. This enables you to add or update data in Google BigQuery with clean, prepped data from your flow each time the flow is run.
STEP 1: CREATE A PUBLISHED DATA SOURCE
- Click the plus icon on a step and select Add Output.
- The output pane opens and shows you a snapshot of your data.
- From the Save output to drop-down list, select the Published data source (Publish as the data source in previous versions).
Complete the following fields:
- Server (Tableau Prep Builder only): Select the server to publish the data source and extract. If you aren’t signed in to a server, you will be prompted to sign in.
Note: Starting in Tableau Prep Builder version 2020.1.4, after you sign in to your server, Tableau Prep Builder remembers your server name and credentials when you close the application. The next time you open the application, you are already signed into your server.
On the Mac, you may be prompted to provide access to your Mac keychain so Tableau Prep Builder can securely use SSL certificates to connect to your Tableau Server or Tableau Online environment. - Project: Select the project where you want to load the data source and extract.
- Name: Enter a file name.
- Description: Enter a description for the data source.
(Tableau Prep Builder version 2020.2.1 and later) In the Write Options section, view the default write option to write the new data to your files and make any changes as needed.
- Create table: This option creates a new table or replaces the existing table with the new output.
- Append to the table: This option adds the new data to your existing table. If the table doesn’t already exist, a new table is created, and subsequent runs will add new rows to this table.
- Click Run Flow to run the flow and publish the data source.
- (optional) Click on the Custom SQL tab and enter your SQL script. You can enter a script to run Before and After the data is written to the table.
STEP 2: DATA VALIDITY
If there are any field mismatches, a status note shows you errors.
- No match: Field is ignored: Fields exist in the flow but not in the database. The field won’t be added to the database table unless you select the Create table write option and perform a full refresh. Then the flow fields are added to the database table and use the flow output schema.
- No match: Field will contain Null values: Fields exist in the database but not in the flow. The flow passes a Null value to the database table for the field. If the field does exist in the flow but is mismatched because the field name is different, you can navigate to a cleaning step and edit the field name to match the database field name.
- Error: Field data types do not match: The data type assigned to a field in both the flow and the database table you are writing your output must match; otherwise, the flow will fail. You can navigate to a cleaning step and edit the field data type to fix this.
STEP 3: WRITE BACK IMPLEMENTATION
A common feature that Data Analysts, Data Scientists, and Data Engineers often ask for in a Business Intelligence Reporting tool is the ability to combine data from different databases, especially data spread across two or more database vendors, such as PostgreSQL, MySQL, SQL Server, Bigquery(*Imagine data available in different regions, US and Southeast Asia respectively). This article explores this feasibility and will show you how to achieve the same using the Tableau Prep tool.
Let me start with an example: Your company is an upstart in the e-commerce space. You have implemented an order and delivery management system that stores data in Excel (*Coming from SQL Server). Additionally, you collect website traffic and visit-related information and store it in separate CSV’s (*Comming from PostgreSQL database). Your Marketing team and the Founders want to understand the best channels that result in a purchase. Thus, your reporting solution needs to combine data from both databases and run fast enough to provide quick drill-downs and filtering on the visualizations you intend to build.
Suppose you have already queried four database tables and joined the data using a conditional expression via Tableau Prep Workflow. Now you are planning to write the data to a Bigquery table
STEP 4: CONNECTION INFORMATION
SCHEMA COMPONENTS
When you specify a table schema, you must supply each column’s name and correct data type.
COLUMN NAMES
A column name must contain only letters (a-z, A-Z), numbers (0–9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 300 characters. A column name cannot use any of the following prefixes:
- _TABLE_
- _FILE_
- _PARTITION
Duplicate column names are not allowed even if the case differs. For example, a column named Column1is considered identical to a column named column1.
Note: The user can specify the database connection and a table name. Please note that the table will be created during writing if it does not exist.
- From the Save output to the drop-down list, a select Database table.
- In the Connection drop-down list, select the database connector where you want to write your flow output.
3.In the Billing Project drop-down list, select the Cloud Billing account for your project.
4. Select the project where your BigQuery dataset is stored from the Project list. Click on the project id. You should now see a list of all the available datasets
5. Specify the Dataset ID. In this case, for demonstration purpose, I am using the “TEMP” Dataset.
6.Add a Table Name(no spaces allowed).In the Table drop-down list, select the table where you want to save your flow output data. Depending on the Write Option you select, a new table will be created, the flow data will replace any existing data in the table, or flow data will be added to the existing table.
7. Run the ETL Prep Flow.
Once the workflow completed, the tableau Prep workflow will write the cleansed/modified data to the bigquery database table. After that, you can open your bigquery table, and validate the data. One thing to keep in mind that currently, Tableau Prep Can write up to 2GB as output the table. While performing this POC, I realized it has very low latency and high throughput which is an added bonus. This is truly a game-changer and will surely help developers to save time. Your final output will look like the below image.
We tried to cover as much as we could for a newbie to get started with designing Tableau Dashboard. Hope you like it. As always, We welcome feedback and constructive criticism. We can be reached on Twitter @rajvivan and @pallavinaik_ . If you enjoyed this blog, we’d love for you to hit the share button so others might stumble upon it. Please hit the subscribe button as well if you’d like to be added to our once-weekly email list, and don’t forget to follow Vizartpandey on Instagram!
Also, here are a few hand-picked articles for you to read next:
- Tableau Online Dashboard — How to do auto-refresh every 5 min
- How to extract Tableau field information using Python API
- Extended bar chart in Tableau
- Taking your design to next level in Tableau
- Create custom maps in Tableau
- Drop Lines using Parameter Actions
- Toggle Button using Parameter Actions
- Overview of Alteryx Architecture
- How to use Buffer Spatial Function In Tableau