Modifying imports: true/false data

Streamlined Data Ingestion with pandas

Amany Mahfouz

Instructor

Boolean Data

  • True/False data

Spreadsheet of survey data with several true/false columns

Streamlined Data Ingestion with pandas

Boolean Data

Spreadsheet of survey data with columns about bootcamp attendance highlighted

Streamlined Data Ingestion with pandas

Boolean Data

Spreadsheet of survey data with columns about loan usage highlighted

Streamlined Data Ingestion with pandas

Boolean Data

Spreadsheet of survey data with several True/False columns

Streamlined Data Ingestion with pandas

Boolean Data

Spreadsheet of survey data with columns using zeros and ones highlighted

Streamlined Data Ingestion with pandas

Boolean Data

Spreadsheet of survey data with columns using Trues and Falses highlighted

Streamlined Data Ingestion with pandas

Boolean Data

Spreadsheet of survey data with columns using yeses and nos highlighted

Streamlined Data Ingestion with pandas

pandas and Booleans

bootcamp_data = pd.read_excel("fcc_survey_booleans.xlsx")
print(bootcamp_data.dtypes)
ID.x                      object
AttendedBootcamp         float64
AttendedBootCampYesNo     object
AttendedBootcampTF       float64
BootcampLoan             float64
LoanYesNo                 object
LoanTF                   float64
dtype: object
Streamlined Data Ingestion with pandas

pandas and Booleans

# Count True values
print(bootcamp_data.sum())
AttendedBootcamp                      38
AttendedBootcampTF                    38
BootcampLoan                          14
LoanTF                                14
dtype: object
# Count NAs
print(bootcamp_data.isna().sum())
ID.x                           0
AttendedBootcamp               0
AttendedBootCampYesNo          0
AttendedBootcampTF             0
BootcampLoan                 964
LoanYesNo                    964
LoanTF                       964
dtype: int64
Streamlined Data Ingestion with pandas
# Load data, casting True/False columns as Boolean
bool_data = pd.read_excel("fcc_survey_booleans.xlsx", 
                          dtype={"AttendedBootcamp": bool,
                                "AttendedBootCampYesNo": bool,
                                "AttendedBootcampTF":bool,
                                "BootcampLoan": bool,
                                "LoanYesNo": bool,
                                "LoanTF": bool})

print(bool_data.dtypes)
ID.x                      object
AttendedBootcamp            bool
AttendedBootCampYesNo       bool
AttendedBootcampTF          bool
BootcampLoan                bool
LoanYesNo                   bool
LoanTF                      bool
dtype: object
Streamlined Data Ingestion with pandas
# Count True values
print(bool_data.sum())
AttendedBootcamp                         38
AttendedBootCampYesNo                  1000
AttendedBootcampTF                       38
BootcampLoan                            978
LoanYesNo                              1000
LoanTF                                  978
dtype: object
# Count NA values
print(bool_data.isna().sum())
ID.x                       0
AttendedBootcamp           0
AttendedBootCampYesNo      0
AttendedBootcampTF         0
BootcampLoan               0
LoanYesNo                  0
LoanTF                     0
dtype: int64
Streamlined Data Ingestion with pandas

pandas and Booleans

  • pandas loads True/False columns as float data by default
  • Specify a column should be bool with read_excel()'s dtype argument
  • Boolean columns can only have True and False values
  • NA/missing values in Boolean columns are changed to True
  • pandas automatically recognizes some values as True/False in Boolean columns
  • Unrecognized values in a Boolean column are also changed to True
Streamlined Data Ingestion with pandas

Setting Custom True/False Values

  • Use read_excel()'strue_values argument to set custom True values
  • Use false_values to set custom False values
  • Each takes a list of values to treat as True/False, respectively
  • Custom True/False values are only applied to columns set as Boolean
Streamlined Data Ingestion with pandas

Setting Custom True/False Values

# Load data with Boolean dtypes and custom T/F values
bool_data = pd.read_excel("fcc_survey_booleans.xlsx", 
                          dtype={"AttendedBootcamp": bool,
                                "AttendedBootCampYesNo": bool,
                                "AttendedBootcampTF":bool,
                                "BootcampLoan": bool,
                                "LoanYesNo": bool,
                                "LoanTF": bool},
                          true_values=["Yes"],
                          false_values=["No"])
Streamlined Data Ingestion with pandas

Setting Custom True/False Values

print(bool_data.sum())
AttendedBootcamp                                  38
AttendedBootCampYesNo                             38
AttendedBootcampTF                                38
BootcampLoan                                     978
LoanYesNo                                        978
LoanTF                                           978
dtype: object
Streamlined Data Ingestion with pandas

Boolean Considerations

  • Are there missing values, or could there be in the future?
  • How will this column be used in analysis?
  • What would happen if a value were incorrectly coded as True?
  • Could the data be modeled another way (e.g., as floats or integers)?
Streamlined Data Ingestion with pandas

Let's practice!

Streamlined Data Ingestion with pandas

Preparing Video For Download...