Python ile Veri Temizleme
Adel Nehme
Content Developer @DataCamp
I) Değer tutarsızlığı
'married', 'Maried', 'UNMARRIED', 'not married'..'married ', ' married '..II) Çok kategoriyi azına daraltma
0-20K, 20-40K...'rich', 'poor' olarak 2 gruba eşlemeIII) Verinin türünün category olduğundan emin olun (Bölüm 1'de görüldü)
Büyük/küçük harf: 'married', 'Married', 'UNMARRIED', 'unmarried'..
# marriage_status sütununu al
marriage_status = demographics['marriage_status']
marriage_status.value_counts()
unmarried 352
married 268
MARRIED 204
UNMARRIED 176
dtype: int64
# DataFrame'de değer sayılarını al
marriage_status.groupby('marriage_status').count()
household_income gender
marriage_status
MARRIED 204 204
UNMARRIED 176 176
married 268 268
unmarried 352 352
# Büyük harfe çevirmarriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper() marriage_status['marriage_status'].value_counts()
UNMARRIED 528
MARRIED 472
# Küçük harfe çevirmarriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower() marriage_status['marriage_status'].value_counts()
unmarried 528
married 472
Sonda/baştaki boşluklar: 'married ', 'married', 'unmarried', ' unmarried'..
# marriage_status sütununu al
marriage_status = demographics['marriage_status']
marriage_status.value_counts()
unmarried 352
unmarried 268
married 204
married 176
dtype: int64
# Tüm boşlukları temizle
demographics = demographics['marriage_status'].str.strip()
demographics['marriage_status'].value_counts()
unmarried 528
married 472
Veriden kategori oluşturma: income sütunundan income_group.
# qcut() kullanımı
import pandas as pd
group_names = ['0-200K', '200K-500K', '500K+']
demographics['income_group'] = pd.qcut(demographics['household_income'], q = 3,
labels = group_names)
# income_group sütununu yazdır
demographics[['income_group', 'household_income']]
category household_income
0 200K-500K 189243
1 500K+ 778533
..
Veriden kategori oluşturma: income sütunundan income_group.
# cut() kullanımı - aralıklar ve adlar
ranges = [0,200000,500000,np.inf]
group_names = ['0-200K', '200K-500K', '500K+']
# income_group sütunu oluştur
demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges,
labels=group_names)
demographics[['income_group', 'household_income']]
category Income
0 0-200K 189243
1 500K+ 778533
Kategorileri daha azına eşleme: kategorik sütundaki sınıfları azaltma.
operating_system sütunu: 'Microsoft', 'MacOS', 'IOS', 'Android', 'Linux'
operating_system sütunu olmalı: 'DesktopOS', 'MobileOS'
# Eşleme sözlüğü oluştur ve değiştir
mapping = {'Microsoft':'DesktopOS', 'MacOS':'DesktopOS', 'Linux':'DesktopOS',
'IOS':'MobileOS', 'Android':'MobileOS'}
devices['operating_system'] = devices['operating_system'].replace(mapping)
devices['operating_system'].unique()
array(['DesktopOS', 'MobileOS'], dtype=object)
Python ile Veri Temizleme