본문 바로가기
데이터분석

[데이터 분석] sales 데이터 전처리하기

by CodingKwon 2021. 6. 30.

데이터 전처리

  • 이상치 삭제하기
  • 년,월,일로 date 분리하기
  • 수입, 연월, 월일 데이터 만들기

데이터 불러오기

import pandas as pd
sales = pd.read_csv('data/sales.csv')
sales.head()

date shop_id item_id item_price item_cnt_day
0 02.01.2013 59 22154 999.00 1.0
1 03.01.2013 25 2552 899.00 1.0
2 05.01.2013 25 2552 899.00 -1.0
3 06.01.2013 25 2554 1709.05 1.0
4 15.01.2013 25 2555 1099.00 1.0
len(sales)
2935849

이상치(item_price, item_cnt_day) 삭제하기

sales[sales['item_price'] <= 0].index
Int64Index([484683], dtype='int64')
sales.drop(sales[sales['item_price'] <= 0].index, inplace=True)
len(sales)
2935848
sales[sales['item_cnt_day'] <= 0].index
Int64Index([      2,     148,     175,     807,    1041,    1193,    1674,
               1825,    2411,    3216,
            ...
            2932589, 2932636, 2932721, 2933448, 2933712, 2934243, 2934462,
            2935263, 2935643, 2935779],
           dtype='int64', length=7356)
sales.drop(sales[sales['item_cnt_day'] <= 0].index, inplace=True)

len(sales)
2928492
sales.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2928492 entries, 0 to 2935848
Data columns (total 5 columns):
 #   Column        Dtype  
---  ------        -----  
 0   date          object 
 1   shop_id       int64  
 2   item_id       int64  
 3   item_price    float64
 4   item_cnt_day  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 134.1+ MB

수입(income) 만들기

sales['income'] = sales['item_price'] * sales['item_cnt_day']

sales.head()

date shop_id item_id item_price item_cnt_day income
0 02.01.2013 59 22154 999.00 1.0 999.00
1 03.01.2013 25 2552 899.00 1.0 899.00
3 06.01.2013 25 2554 1709.05 1.0 1709.05
4 15.01.2013 25 2555 1099.00 1.0 1099.00
5 10.01.2013 25 2564 349.00 1.0 349.00

년,월,일 분리하기

sales[['day','month','year']]=sales['date'].str.split('.', expand=True)
sales.head()

date shop_id item_id item_price item_cnt_day income day month year
0 02.01.2013 59 22154 999.00 1.0 999.00 02 01 2013
1 03.01.2013 25 2552 899.00 1.0 899.00 03 01 2013
3 06.01.2013 25 2554 1709.05 1.0 1709.05 06 01 2013
4 15.01.2013 25 2555 1099.00 1.0 1099.00 15 01 2013
5 10.01.2013 25 2564 349.00 1.0 349.00 10 01 2013

연월(y_m) 만들기

sales['y_m'] = sales['year'] + '.' + sales['month']

월일(m_d) 만들기

sales['m_d'] = sales['month'] + '.' + sales['day']

sales.head()

date shop_id item_id item_price item_cnt_day income day month year y_m m_d
0 02.01.2013 59 22154 999.00 1.0 999.00 02 01 2013 2013.01 01.02
1 03.01.2013 25 2552 899.00 1.0 899.00 03 01 2013 2013.01 01.03
3 06.01.2013 25 2554 1709.05 1.0 1709.05 06 01 2013 2013.01 01.06
4 15.01.2013 25 2555 1099.00 1.0 1099.00 15 01 2013 2013.01 01.15
5 10.01.2013 25 2564 349.00 1.0 349.00 10 01 2013 2013.01 01.10

데이터 저장

sales.drop('date', axis=1, inplace=True)
sales.head()

shop_id item_id item_price item_cnt_day income day month year y_m m_d
0 59 22154 999.00 1.0 999.00 02 01 2013 2013.01 01.02
1 25 2552 899.00 1.0 899.00 03 01 2013 2013.01 01.03
3 25 2554 1709.05 1.0 1709.05 06 01 2013 2013.01 01.06
4 25 2555 1099.00 1.0 1099.00 15 01 2013 2013.01 01.15
5 25 2564 349.00 1.0 349.00 10 01 2013 2013.01 01.10
sales.to_csv('data/sales_pre.csv', index=False)

댓글