Data Cleansing and Filtering

Transform and Analyze Data with Microsoft Fabric

Luis Silva

Solution Architect - Data & AI

Data cleansing tools

 

 

Icons representing three tools: SQL, Spark and Dataflows

Transform and Analyze Data with Microsoft Fabric

Handling blank rows

  • Remove row when entire row is blank

 

Diagram showing a table with several blank rows transformed into a table where the blank rows have been removed

Transform and Analyze Data with Microsoft Fabric

Handling blank rows with Dataflows

  • Remove blank rows transformation

 

Screenshot of the Remove rows menu options in Dataflow, highlighting the option to remove blank rows, with a table example of before and after removing the rows

Transform and Analyze Data with Microsoft Fabric

Handling blank rows with Spark

  • DataFrame dropna( ) function
  • Can drop a row with any null values, or with all null values

Table example of before and after removing the rows using two variations of the dropna function

Transform and Analyze Data with Microsoft Fabric

Handling duplicate rows

  • Remove duplicate rows

 

Diagram showing a table with a duplicate row transformed into a table where one of the duplicate records has been removed

Transform and Analyze Data with Microsoft Fabric

Handling duplicate rows with Dataflows

  • Remove duplicates transformation

 

Screenshot of the Remove rows menu options in Dataflow, highlighting the option to remove duplicates, with a table example of before and after removing the rows

Transform and Analyze Data with Microsoft Fabric

Handling duplicate rows with Spark

  • DataFrame dropDuplicates( ) function

 

Diagram showing a table with a duplicate row transformed into a table where one of the duplicate records has been removed after calling the dataframe dropDuplicates function

Transform and Analyze Data with Microsoft Fabric

Replacing values

  • Replace occurrences of a certain value in a column with another value.
  • Can also be used to replace empty cells with a value.

 

Diagram showing a table with a the value of ABC on a couple of records entries being replaced with XYA values.

Transform and Analyze Data with Microsoft Fabric

Replacing values with Dataflows

  • Replace values transformation

 

Screenshot of the Replace values transformation menu option in Dataflow, with a table example of before and after changing empty values for the string '(Unknown)'

Transform and Analyze Data with Microsoft Fabric

Replacing values with Spark

  • DataFrame replace( )_ function

 

Diagram showing a table with the N/A value in a couple of records being replaced for the string 'Unknown' after calling the PySpark function replace

Transform and Analyze Data with Microsoft Fabric

Correcting data types

  • Consistency of data types across the model
  • Supports comparison between values

Image representing a data column storing numbers as text being converted to actual numbers

Transform and Analyze Data with Microsoft Fabric

Correcting data types

  • Consistency of data types across the model.
  • Supports comparison between values.
  • Supports datatype-specific operations.
    • Arithmetic operations on numeric columns.
    • Date/time operations on date/time columns.

Example of some of the transformations available on a text column, like lowercase, uppercase and trim, and transformations available on a numeric column, like sum and maximum

Transform and Analyze Data with Microsoft Fabric

Correcting data types with Dataflows

  • Context-sensitive menu (option 1):
    1. Select a column in the data view.
    2. Click on the icon on the left side of the column heading.

Screenshot highlighting the icon on the left side of a column's  heading that gives access to the Change type context menu

Transform and Analyze Data with Microsoft Fabric

Correcting data types with Dataflows

  • Context-sensitive menu (option 1):
    1. Select a column in the data view.
    2. Click on the icon on the left side of the column heading.
    3. Select a new data type from the list.

Screenshot of the Change type context menu available when clicking on the data type in the heading of a column in the data view

Transform and Analyze Data with Microsoft Fabric

Correcting data types with Dataflows

  • Context-sensitive menu (option 2):
    1. Select a column in the data view.
    2. Right-click on the column heading.

Screenshot highlighting the heading of a column that we wish to change the data type

Transform and Analyze Data with Microsoft Fabric

Correcting data types with Dataflows

  • Add a Transform:
    1. Select a column in the data view.
    2. Add a transformation by clicking on (+) in the diagram view.
    3. Click on Change type from the pop-up menu and select a new data type from the list.

Screenshot showing a query where the Change Type transform is being added

Transform and Analyze Data with Microsoft Fabric

Correcting data types with Spark

  • DataFrame cast( )_ function
df.withColumn("new_column_name", df["column_name"].cast("new_data_type"))
  • Example:
# Convert the [orderID] column to integer
df = df.withColumn("orderID", df["orderID"].cast("integer"))
Transform and Analyze Data with Microsoft Fabric

Data Filtering

  • Include or exclude rows based on some condition.
SELECT
  OrderID, State, Date, Amount
FROM
  Orders 
WHERE
  State = 'NY'

Diagram that shows a table filtering out any row not related to the state of NY

Transform and Analyze Data with Microsoft Fabric

Data filtering with Dataflows

  • Use the Filter transformation

Screenshot of the filter transform applied to a column

Transform and Analyze Data with Microsoft Fabric

Data filtering with Spark

  • DataFrame filter( ) function
  • where( ) is an alias for filter( )
df.filter( <condition> )
  • Example:
# Return only DataFrame rows whete State = 'NY'
df.filter( df.state == 'NY' )
Transform and Analyze Data with Microsoft Fabric

Let's practice!

Transform and Analyze Data with Microsoft Fabric

Preparing Video For Download...