Ghostly whitespaces that spooks data scientists and what can save their day.

Data scientists often wonders if whitespaces in their codes and data would cause some problems and whether it makes sense to replace them with “_” or “-“ instead.  While whitespaces exist out of necessity in Python codes as tab indentation in functions and for loops, its presence in data can be troublesome. Let’s confirm if Pandas column name for example can tolerate whitespace.

 

Data Scientist Whitespaces Example 1:

The first column name, ‘Data Engineering’, is separated by whitespace. When we run the following in Jupyter Notebook, we get the desired output – confirming that whitespace is tolerated in column name.

Data scientists whitespaces

Now let’s check if whitespace in list items and dataframe column values is giving us any problem. Whitespace separated ‘data cleansing’ as a new entry in ‘Data Engineering’ column would serve the purpose. Once again, we get the desired output when run in Jupyter Notebook without any error.

Data scientists whitespaces

But now can the same seemingly innocent whitespaces in our data giving as nightmare when we process them further for, say visualization purposes.

 

Data Scientist Whitespaces Example 2:

Let’s see another example where we need to produce pivot table before plotting a stacked bar chart sums of the counts of keywords.

whitespaceCode3

When we run the following codes and inspect the dataframe it gives the desired output along with correct sums of keywords. From the pivot table we do not notice any whitespace that worries us since our earlier observation suggests that nothing should go wrong. But unfortunately, when we start creating traces for our stacked bar chart, and assuming there are no problematic whitespaces, we get key error.

whitespaceCode4

 

The Solution

It says something is wrong with our keyword names ‘DE’ and ‘Python’. We could not have guessed that this is a whitespace problem.There are whitespaces in our keywords and we are calling them without the whitespaces. Let’s now see if Pandas can reveal this whitespace problem, and the following codes expose it. Obviously, we can see trailing whitespace after both keywords.

Data scientists whitespaces

Fortunately a simple Lambda function can be employed to rename our column names and simultaneously stripping the trailing whitespaces.

Data scientists whitespaces

 

 225 total views,  1 views today

Leave a Reply

CommentLuv badge

This site uses Akismet to reduce spam. Learn how your comment data is processed.