판다는 귀여운데 판다스는

업데이트: March 10, 2020

Pandas Study

데이터 분석에 기초가 되어줄 판다스 라이브러리 공부하기

groupby 가능! 거기에 sum(), mean() 등도 가능!

  df1 = df.groupby('col1')
  df1 = df1.sum()
  c = chipo.groupby('choice_description').sum().sort_values(by='quantity', ascending=False)

apply 로 함수 적용. lambda와 같은.

  df['col1'] = df['col1'].apply(lambda x: float(x))

해당 col의 그룹화하여 개수 세기

  df['col1'].value_counts()
  df['col1'].nunique()

모든 피처 describe()

  df.describe(include='all') # default = only numeric col

and와 or필터링

  df[df['col']>20][df['col1']=='d'] # and연산
  df[(df['col']>20) | (df['col']<10)] # or연산

그룹화 후 여러 통계량 조회

  df.groupby('col')['col2'].agg(['max', 'min', 'median'])

그룹화 후 agg와 딕셔너리를 활용하기!

  gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})

groupby 연산 후 피벗처럼 보기

  df.groupby(['col1', 'col2'])['col3'].mean().unstack()

groupby로 묶으면

그룹화한 (index, 나머지 값들) 로 저장됨.

  for colname, group in df.groupby('col1'):
  	print(colname)  # colname출력
  	print(group)  # 나머지 값들 출력

datetime으로 데이터 변환

  df['col'] = pdf.to_datetime(df['col'], format='%Y')

index 설정

reset_index로 0부터 n-1까지의 숫자로 초기화시킬 수 있다.

  df.set_index('col', inplace=True)
  df = df.set_index('col', drop=True)
  df.reset_index(drop=True, inplace=True)

col 삭제

여러개 삭제는 df.drop(df.columns[[1,2,3], axis=1)

  del df['col']
  df = df.drop(df.columns[[1,2,3]], axis=1)

    ts.resample('col', closed='right').mean() # 다운샘플링

idxmin, idxmax

  df.idxmin() # 최소값을 가지고 있는 index 반환
  df.idxmax() # 최대값을 가지고 있는 index 반환

df concat

  df = pd.concat([df1, df2], axis=0) # 이건 df1.append(df2)랑 같음

df merge (DB쿼리와 매우 유사하다!)

  pd.merge(df1, df2, on='key_col', how='outer')

datetime.date(yy,mm,dd)로 선언한 col은 object형으로 저
- pd.to_datetime(‘col’)로 datetime62로 변환해줘야함.
```
  df['col'] = pd.to_datetime(df['col'])
```
결측값
- 결측값 계산: isnull (<> notnull)
- 결측값 없애기: dropna()
```
  df.isnull().sum()
  df.dropna()
```
다 합치기
```
  df.sum().sum()
```
행단위로 계산하기
```
  df['rawmin'] = df['col'].min(axis=1)
```
index 가져오기
```
  df.index.get_level_values(1)
```
index is duplicated?
```
  df.index.is_unique
```
인덱스 정렬
```
  df.sort_index()
```
timedelta를 day로 계산
- timedelta 객체 자세한 정보
```
  timedelta.days
```