데이터 전처리
- 이상치 삭제하기
- 년,월,일로 date 분리하기
- 수입, 연월, 월일 데이터 만들기
데이터 불러오기
import pandas as pd
sales = pd.read_csv('data/sales.csv')
sales.head()
|
date |
shop_id |
item_id |
item_price |
item_cnt_day |
0 |
02.01.2013 |
59 |
22154 |
999.00 |
1.0 |
1 |
03.01.2013 |
25 |
2552 |
899.00 |
1.0 |
2 |
05.01.2013 |
25 |
2552 |
899.00 |
-1.0 |
3 |
06.01.2013 |
25 |
2554 |
1709.05 |
1.0 |
4 |
15.01.2013 |
25 |
2555 |
1099.00 |
1.0 |
len(sales)
2935849
이상치(item_price, item_cnt_day) 삭제하기
sales[sales['item_price'] <= 0].index
Int64Index([484683], dtype='int64')
sales.drop(sales[sales['item_price'] <= 0].index, inplace=True)
len(sales)
2935848
sales[sales['item_cnt_day'] <= 0].index
Int64Index([ 2, 148, 175, 807, 1041, 1193, 1674,
1825, 2411, 3216,
...
2932589, 2932636, 2932721, 2933448, 2933712, 2934243, 2934462,
2935263, 2935643, 2935779],
dtype='int64', length=7356)
sales.drop(sales[sales['item_cnt_day'] <= 0].index, inplace=True)
len(sales)
2928492
sales.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2928492 entries, 0 to 2935848
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 date object
1 shop_id int64
2 item_id int64
3 item_price float64
4 item_cnt_day float64
dtypes: float64(2), int64(2), object(1)
memory usage: 134.1+ MB
수입(income) 만들기
sales['income'] = sales['item_price'] * sales['item_cnt_day']
sales.head()
|
date |
shop_id |
item_id |
item_price |
item_cnt_day |
income |
0 |
02.01.2013 |
59 |
22154 |
999.00 |
1.0 |
999.00 |
1 |
03.01.2013 |
25 |
2552 |
899.00 |
1.0 |
899.00 |
3 |
06.01.2013 |
25 |
2554 |
1709.05 |
1.0 |
1709.05 |
4 |
15.01.2013 |
25 |
2555 |
1099.00 |
1.0 |
1099.00 |
5 |
10.01.2013 |
25 |
2564 |
349.00 |
1.0 |
349.00 |
년,월,일 분리하기
sales[['day','month','year']]=sales['date'].str.split('.', expand=True)
sales.head()
|
date |
shop_id |
item_id |
item_price |
item_cnt_day |
income |
day |
month |
year |
0 |
02.01.2013 |
59 |
22154 |
999.00 |
1.0 |
999.00 |
02 |
01 |
2013 |
1 |
03.01.2013 |
25 |
2552 |
899.00 |
1.0 |
899.00 |
03 |
01 |
2013 |
3 |
06.01.2013 |
25 |
2554 |
1709.05 |
1.0 |
1709.05 |
06 |
01 |
2013 |
4 |
15.01.2013 |
25 |
2555 |
1099.00 |
1.0 |
1099.00 |
15 |
01 |
2013 |
5 |
10.01.2013 |
25 |
2564 |
349.00 |
1.0 |
349.00 |
10 |
01 |
2013 |
연월(y_m) 만들기
sales['y_m'] = sales['year'] + '.' + sales['month']
월일(m_d) 만들기
sales['m_d'] = sales['month'] + '.' + sales['day']
sales.head()
|
date |
shop_id |
item_id |
item_price |
item_cnt_day |
income |
day |
month |
year |
y_m |
m_d |
0 |
02.01.2013 |
59 |
22154 |
999.00 |
1.0 |
999.00 |
02 |
01 |
2013 |
2013.01 |
01.02 |
1 |
03.01.2013 |
25 |
2552 |
899.00 |
1.0 |
899.00 |
03 |
01 |
2013 |
2013.01 |
01.03 |
3 |
06.01.2013 |
25 |
2554 |
1709.05 |
1.0 |
1709.05 |
06 |
01 |
2013 |
2013.01 |
01.06 |
4 |
15.01.2013 |
25 |
2555 |
1099.00 |
1.0 |
1099.00 |
15 |
01 |
2013 |
2013.01 |
01.15 |
5 |
10.01.2013 |
25 |
2564 |
349.00 |
1.0 |
349.00 |
10 |
01 |
2013 |
2013.01 |
01.10 |
데이터 저장
sales.drop('date', axis=1, inplace=True)
sales.head()
|
shop_id |
item_id |
item_price |
item_cnt_day |
income |
day |
month |
year |
y_m |
m_d |
0 |
59 |
22154 |
999.00 |
1.0 |
999.00 |
02 |
01 |
2013 |
2013.01 |
01.02 |
1 |
25 |
2552 |
899.00 |
1.0 |
899.00 |
03 |
01 |
2013 |
2013.01 |
01.03 |
3 |
25 |
2554 |
1709.05 |
1.0 |
1709.05 |
06 |
01 |
2013 |
2013.01 |
01.06 |
4 |
25 |
2555 |
1099.00 |
1.0 |
1099.00 |
15 |
01 |
2013 |
2013.01 |
01.15 |
5 |
25 |
2564 |
349.00 |
1.0 |
349.00 |
10 |
01 |
2013 |
2013.01 |
01.10 |
sales.to_csv('data/sales_pre.csv', index=False)
댓글