Python_pandas

padas

https://pandas.pydata.org/

빅데이터 시대

  • 데이터로부터 유용한 정보를 뽑아내는 분석프로세스를 위해
  • 데이터를 수집하고 정리하는 데 최적화된 도구

판다스 자료 구조

  • 분석을 위해 다양한 소스로부터 수집하는 데이터는 형태나 속성이 매우 다양함
  • 서로 다른 형식을 갖는 여러 종류의 데이터를 컴퓨터가 이해 할 수 있도록 동일한 형식을 갖는 구조로 통합 해야함
  • Series(1차원) 와 Dataframe(2차원) 이라는 구조화된 데이터 형식을 제공
  • 서로다른 여러가지 유형의 데이터를 공통의 포맷으로 정리하는 목적
  • Dataframe : 행과 열로 이루어진 2차원 구조의 형태로 데이터 분석 실무에 자주 사용됨

1. 시리즈(Series)

  • 데이터가 순차적으로 나열된 1차원 배열의 형태
  • 인덱스(index)는 데이터값(value)와 일대일 대응
  • 파이썬의 딕셔너리와 비슷한 구조

딕셔너리 ==> 시리즈

pandas.Series(딕셔너리)

In [3]:
import pandas as pd 
In [4]:
dict_data= {'a':1, 'b':2, 'c':3}
sr=pd.Series(dict_data)
print(type(sr))
print()
print(sr)
<class 'pandas.core.series.Series'>

a    1
b    2
c    3
dtype: int64
In [5]:
obj=pd.Series([4,7,-5,3]) #인덱스 지정 안 했을 때 디폴트로 0,1,2,3.. 나옴
print(obj)
0    4
1    7
2   -5
3    3
dtype: int64

Series의 index / value

  • Series객체.index : 인덱스 배열
  • Series객체.values : 데이터값 배열
In [6]:
print(obj.values)
print(obj.index)
[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)
In [7]:
import pandas as pd
obj2=pd.Series([4,7,-5,3], index=['d', 'b', 'a', 'c'])
print(obj2)
print(obj2.index) 
d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')
In [8]:
import numpy as np 
import pandas as pd
list_A=np.array(list('abcdef'))
list_B= np.arange(10,70,10)

dict_data={key:value for key,value in zip(list_A,list_B)}
print(dict_data)
{'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50, 'f': 60}
In [9]:
sr=pd.Series(dict_data)
sr
Out[9]:
a    10
b    20
c    30
d    40
e    50
f    60
dtype: int64
In [10]:
# 위의 과정보다 간편
import numpy as np 
import pandas as pd
list_A=np.array(list('abcdef'))
list_B= np.arange(10,70,10)

sr=pd.Series(list_B, index=list_A)
for i in range(sr.size):
    key=sr.index[i]
    print("sr['{}'] : {} or sr[{}] : {}".format(key,sr[key],i, sr.values[i]))
sr['a'] : 10 or sr[0] : 10
sr['b'] : 20 or sr[1] : 20
sr['c'] : 30 or sr[2] : 30
sr['d'] : 40 or sr[3] : 40
sr['e'] : 50 or sr[4] : 50
sr['f'] : 60 or sr[5] : 60
In [11]:
print(sr['a'], sr[0], sr.values[0]) # 같은 값 
10 10 10
In [12]:
print(sr.index[0])
a
In [13]:
print(obj2)
print()
print(obj2[obj2>0])
print()
print(obj2*2)
print()
print(np.exp(obj2))
d    4
b    7
a   -5
c    3
dtype: int64

d    4
b    7
c    3
dtype: int64

d     8
b    14
a   -10
c     6
dtype: int64

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
In [14]:
print('b' in obj2)
print('e' in obj2)
True
False
In [15]:
sdata= {'ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj3=pd.Series(sdata)
print(obj3)
ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
In [16]:
states= ['Callifornia', 'ohio','Texas','Oregon']
print(type(states))
obj4=pd.Series(sdata, index=states)
print(obj4)
<class 'list'>
Callifornia        NaN
ohio           35000.0
Texas          71000.0
Oregon         16000.0
dtype: float64
In [17]:
import pandas as pd
print(pd.isnull(obj4)) #비어있냐 
print() 
print(pd.notnull(obj4)) #비어있지않냐
Callifornia     True
ohio           False
Texas          False
Oregon         False
dtype: bool

Callifornia    False
ohio            True
Texas           True
Oregon          True
dtype: bool
In [18]:
print(obj4.isnull())
Callifornia     True
ohio           False
Texas          False
Oregon         False
dtype: bool
In [19]:
print(obj3)
print()
print(obj4)
print()
print(obj3+obj4)
ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Callifornia        NaN
ohio           35000.0
Texas          71000.0
Oregon         16000.0
dtype: float64

Callifornia         NaN
Oregon          32000.0
Texas          142000.0
Utah                NaN
ohio            70000.0
dtype: float64
In [20]:
#print(obj4.name)

obj4.name='population'
obj4.index.name='state' 
print(obj4)
state
Callifornia        NaN
ohio           35000.0
Texas          71000.0
Oregon         16000.0
Name: population, dtype: float64
In [21]:
print(obj)
0    4
1    7
2   -5
3    3
dtype: int64
In [22]:
obj.index=['Bob', 'Steve', 'Jeff', 'Ryan']
print(obj)
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

2. 데이터프레임(DataFrame)

  • 2차원 배열
  • R의 데이터 프레임에서 유래
  • 엑셀, 관계형 DB등에서 사용됨
  • 하나의 열이 각각의 Series객체임
In [23]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], #키, 값 
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
In [24]:
frame
Out[24]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
In [25]:
frame.head() 
Out[25]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
In [26]:
frame.tail()
Out[26]:
state year pop
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
In [27]:
#column의 순서를 바꿀 수 있음
pd.DataFrame(data, columns=['year','state','pop'])
Out[27]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2

행 인덱스/ 열 이름 설정: pandas.DataFrame(2차원 배열, index=행 인덱스 배열, colimns=열 이름 배열)

In [28]:
import pandas as pd
frame2= pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2
Out[28]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN

행 인덱스 변경: DataFrame 객체.rename(index{기존 인덱스:새 인덱스, ...})

열 이름 변경 : DataFrame 객체.rename(colums{기존 이름:새 이름,...})

In [33]:
print(frame2.columns)
Index(['YEA', 'STA', 'POP', 'DEBT'], dtype='object')
In [34]:
frame2.rename(columns={'year': 'YEA', 'state':'STA', 'pop':'POP', 'debt':'DEBT'}, inplace=True)

frame2.rename(index={'one': '01', 'two':'02'}, inplace=True)

frame2
Out[34]:
YEA STA POP DEBT
01 2000 Ohio 1.5 NaN
02 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
In [35]:
frame2['STA']
Out[35]:
01         Ohio
02         Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: STA, dtype: object
In [36]:
frame2.YEA
Out[36]:
01       2000
02       2001
three    2002
four     2001
five     2002
six      2003
Name: YEA, dtype: int64

.iloc[[행],[열]]

  • Data의 행 번호 활용, integer만 가능
    ### .loc[[행],[열]]
  • DataFrame index 활용, 아무 것이나 활용 가능
In [37]:
frame2
Out[37]:
YEA STA POP DEBT
01 2000 Ohio 1.5 NaN
02 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
In [38]:
frame2.loc['three']
Out[38]:
YEA     2002
STA     Ohio
POP      3.6
DEBT     NaN
Name: three, dtype: object
In [39]:
frame2.iloc[2]
Out[39]:
YEA     2002
STA     Ohio
POP      3.6
DEBT     NaN
Name: three, dtype: object
In [40]:
frame2['DEB']=16.5 #한 열의 값을 통째로 바꿈 
frame2
Out[40]:
YEA STA POP DEBT DEB
01 2000 Ohio 1.5 NaN 16.5
02 2001 Ohio 1.7 NaN 16.5
three 2002 Ohio 3.6 NaN 16.5
four 2001 Nevada 2.4 NaN 16.5
five 2002 Nevada 2.9 NaN 16.5
six 2003 Nevada 3.2 NaN 16.5
In [41]:
frame2['DEB']=np.arange(1,13,2)
frame2
Out[41]:
YEA STA POP DEBT DEB
01 2000 Ohio 1.5 NaN 1
02 2001 Ohio 1.7 NaN 3
three 2002 Ohio 3.6 NaN 5
four 2001 Nevada 2.4 NaN 7
five 2002 Nevada 2.9 NaN 9
six 2003 Nevada 3.2 NaN 11
In [42]:
val=pd.Series([-1.2,-1.5,-1.7], index=['02', 'four', 'six'])
frame2['DEB']=val
frame2
Out[42]:
YEA STA POP DEBT DEB
01 2000 Ohio 1.5 NaN NaN
02 2001 Ohio 1.7 NaN -1.2
three 2002 Ohio 3.6 NaN NaN
four 2001 Nevada 2.4 NaN -1.5
five 2002 Nevada 2.9 NaN NaN
six 2003 Nevada 3.2 NaN -1.7
In [43]:
frame2['eastern']=frame2.STA=='Ohio'
frame2
Out[43]:
YEA STA POP DEBT DEB eastern
01 2000 Ohio 1.5 NaN NaN True
02 2001 Ohio 1.7 NaN -1.2 True
three 2002 Ohio 3.6 NaN NaN True
four 2001 Nevada 2.4 NaN -1.5 False
five 2002 Nevada 2.9 NaN NaN False
six 2003 Nevada 3.2 NaN -1.7 False
In [44]:
frame2['Big_State']=(frame2.STA=='Ohio') & (frame2.POP>3.0)
frame2 
Out[44]:
YEA STA POP DEBT DEB eastern Big_State
01 2000 Ohio 1.5 NaN NaN True False
02 2001 Ohio 1.7 NaN -1.2 True False
three 2002 Ohio 3.6 NaN NaN True True
four 2001 Nevada 2.4 NaN -1.5 False False
five 2002 Nevada 2.9 NaN NaN False False
six 2003 Nevada 3.2 NaN -1.7 False False
In [45]:
del frame2['eastern']
frame2
Out[45]:
YEA STA POP DEBT DEB Big_State
01 2000 Ohio 1.5 NaN NaN False
02 2001 Ohio 1.7 NaN -1.2 False
three 2002 Ohio 3.6 NaN NaN True
four 2001 Nevada 2.4 NaN -1.5 False
five 2002 Nevada 2.9 NaN NaN False
six 2003 Nevada 3.2 NaN -1.7 False
In [46]:
del frame2['Big_State']
frame2
Out[46]:
YEA STA POP DEBT DEB
01 2000 Ohio 1.5 NaN NaN
02 2001 Ohio 1.7 NaN -1.2
three 2002 Ohio 3.6 NaN NaN
four 2001 Nevada 2.4 NaN -1.5
five 2002 Nevada 2.9 NaN NaN
six 2003 Nevada 3.2 NaN -1.7

중첩된 딕셔너리

In [47]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [48]:
frame3= pd.DataFrame(pop)
frame3
Out[48]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
In [49]:
frame3.T
Out[49]:
2001 2002 2000
Nevada 2.4 2.9 NaN
Ohio 1.7 3.6 1.5
In [50]:
pd.DataFrame(pop, index=[2001,2002,2003]) 
Out[50]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
In [51]:
frame3
Out[51]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2000 NaN 1.5
In [52]:
print(frame3.iloc[0,0])
print(frame3.iloc[0,1])
print(frame3.iloc[1,0])
print(frame3.iloc[1,1])
2.4
1.7
2.9
3.6
In [53]:
frame3.iloc[0,[0,1]]
Out[53]:
Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64
In [54]:
frame3.iloc[0,0:]
Out[54]:
Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64
In [55]:
pdata= {'Ohio' : frame3['Ohio'][:-1], 'Nevada' : frame3['Nevada'][:-2]}
pd.DataFrame(pdata)
Out[55]:
Ohio Nevada
2001 1.7 2.4
2002 3.6 NaN
In [56]:
import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
In [57]:
titanic.head()
Out[57]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [58]:
titanic.tail()
Out[58]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
886 0 2 male 27.0 0 0 13.00 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.00 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.45 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.00 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.75 Q Third man True NaN Queenstown no True
In [59]:
df = titanic.loc[:,['age', 'fare']]
In [60]:
df.head()
Out[60]:
age fare
0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500
In [61]:
df.tail()
Out[61]:
age fare
886 27.0 13.00
887 19.0 30.00
888 NaN 23.45
889 26.0 30.00
890 32.0 7.75
In [62]:
df_add10= df+ 10
In [63]:
df_add10.head()
Out[63]:
age fare
0 32.0 17.2500
1 48.0 81.2833
2 36.0 17.9250
3 45.0 63.1000
4 45.0 18.0500
In [64]:
print(type(df_add10))
<class 'pandas.core.frame.DataFrame'>
In [65]:
df_sub= df_add10-df
df_sub 
Out[65]:
age fare
0 10.0 10.0
1 10.0 10.0
2 10.0 10.0
3 10.0 10.0
4 10.0 10.0
... ... ...
886 10.0 10.0
887 10.0 10.0
888 NaN 10.0
889 10.0 10.0
890 10.0 10.0

891 rows × 2 columns

색인

In [66]:
obj=pd.Series(range(3), index=['a', 'b', 'c'])
index= obj.index
print(index)
index[1:]
Index(['a', 'b', 'c'], dtype='object')
Out[66]:
Index(['b', 'c'], dtype='object')
In [67]:
import numpy as np
import pandas as pd 
labels=pd.Index(np.arange(3))
print(labels)
print()
obj2=pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)
Int64Index([0, 1, 2], dtype='int64')

0    1.5
1   -2.5
2    0.0
dtype: float64
In [68]:
obj2.index is labels
Out[68]:
True
In [69]:
dup_labels=pd.Index(['foo', 'foo', 'bar', 'bar']) #중복 가능 
dup_labels
Out[69]:
Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
In [70]:
obj=pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
Out[70]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
In [71]:
obj2=obj.reindex(['a','b','c','d','e'])
obj2
Out[71]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
In [72]:
obj3=pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3
Out[72]:
0      blue
2    purple
4    yellow
dtype: object
In [73]:
obj3.reindex(range(6), method="ffill") #Nan 값을 앞의 값으로 채운다
Out[73]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
In [74]:
import numpy as np 
import pandas as pd 

frame=pd.DataFrame(np.arange(9).reshape((3,3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California'])
frame
Out[74]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [75]:
frame2= frame.reindex(['a','b','c','d'])
frame2
Out[75]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
In [76]:
states=['Texas', 'Utah', 'California']
frame.reindex(columns=states)
Out[76]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
In [77]:
obj=pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj
Out[77]:
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
In [78]:
new_obj= obj.drop('c')
new_obj
Out[78]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
In [79]:
new_obj2=obj.drop(['d', 'c'])
new_obj2
Out[79]:
a    0.0
b    1.0
e    4.0
dtype: float64
In [80]:
data=pd.DataFrame(np.arange(16).reshape((4,4)), index= ['Ohio','Colorado','Utah', 'New York'] ,columns=['one', 'two','three', 'four'])
data
Out[80]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [81]:
data.drop(['Colorado', 'Ohio']) # drop은 행을 삭제함
Out[81]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [82]:
data.drop('two', axis=1) # 열 삭제 
Out[82]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [83]:
data2= data.drop('two', axis=1)
data2.drop('Utah', axis=0)
Out[83]:
one three four
Ohio 0 2 3
Colorado 4 6 7
New York 12 14 15
In [84]:
data.drop(['two','four'], axis = 1)
Out[84]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
In [85]:
data.drop('Ohio', axis='rows') #axis='rows or axis=0은 생략가능 
Out[85]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [86]:
data.drop('Ohio')
Out[86]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [87]:
data
Out[87]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [88]:
data3=data.copy() 
data3.drop("Ohio", inplace=True)
data3
Out[88]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

인덱싱

In [89]:
obj= pd.Series(np.arange(4.), index=['a','b','c','d'])
obj
Out[89]:
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
In [90]:
print(obj['b'], obj[1]); print()
print(obj[2:4])
print(obj[['b','a','d']])
print(obj[[1,3]]); print()
print((range(4),obj.index is obj))
1.0 1.0

c    2.0
d    3.0
dtype: float64
b    1.0
a    0.0
d    3.0
dtype: float64
b    1.0
d    3.0
dtype: float64

(range(0, 4), False)
In [91]:
obj['b':'c']=5
obj
Out[91]:
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64
In [92]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
Out[92]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [93]:
data['two']
Out[93]:
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
In [94]:
data[['three','one']]
Out[94]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
In [95]:
data[:2]
Out[95]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
In [96]:
data[data['three']>5]
Out[96]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [97]:
data[data<5] =0 #data<5 True, True는 0으로 바뀜 
data
Out[97]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [98]:
data.loc['Colorado', ['two', 'three']]
Out[98]:
two      5
three    6
Name: Colorado, dtype: int32
In [99]:
data
Out[99]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [100]:
data.iloc[2,[3,0,1]]
Out[100]:
four    11
one      8
two      9
Name: Utah, dtype: int32
In [101]:
data.iloc[[1,2], [3,0,1]]
Out[101]:
four one two
Colorado 7 0 5
Utah 11 8 9
In [102]:
data.loc[:'Utah', 'two']
Out[102]:
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32
In [103]:
data.iloc[:,:3][data.three>5]
Out[103]:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
In [104]:
ser = pd.Series(np.arange(3.))
ser
Out[104]:
0    0.0
1    1.0
2    2.0
dtype: float64
In [105]:
ser[:1]
ser.loc[:1]
ser.iloc[:1]
Out[105]:
0    0.0
dtype: float64
In [106]:
print(ser[:1]); print()
print(ser.loc[:1]); print() #숫자1 
print(ser.iloc[:1])
0    0.0
dtype: float64

0    0.0
1    1.0
dtype: float64

0    0.0
dtype: float64
In [107]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
Out[107]:
b d e
Utah 0.579217 -0.279336 -0.170469
Ohio -1.724667 -1.571901 -0.108527
Texas -0.961463 -0.701714 1.134606
Oregon -0.737585 -0.111093 1.484404
In [108]:
np.abs(frame) #절대값
Out[108]:
b d e
Utah 0.579217 0.279336 0.170469
Ohio 1.724667 1.571901 0.108527
Texas 0.961463 0.701714 1.134606
Oregon 0.737585 0.111093 1.484404

np.random.randn : 평균 0 표준편차가 1인 가우시안 정규분포 난수 matrix 생성

In [109]:
f=lambda x:x.max()-x.min()
frame.apply(f)
Out[109]:
b    2.303884
d    1.460808
e    1.654872
dtype: float64
In [110]:
frame.apply(f, axis='columns')
Out[110]:
Utah      0.858553
Ohio      1.616140
Texas     2.096070
Oregon    2.221988
dtype: float64
In [111]:
frame
Out[111]:
b d e
Utah 0.579217 -0.279336 -0.170469
Ohio -1.724667 -1.571901 -0.108527
Texas -0.961463 -0.701714 1.134606
Oregon -0.737585 -0.111093 1.484404
In [112]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min','max'])
frame.apply(f)
Out[112]:
b d e
min -1.724667 -1.571901 -0.170469
max 0.579217 -0.111093 1.484404
In [113]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c']) 
obj
Out[113]:
d    0
a    1
b    2
c    3
dtype: int64

index를 기준으로 sorting

In [114]:
obj.sort_index()
Out[114]:
a    1
b    2
c    3
d    0
dtype: int64
In [115]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c']) 
frame
Out[115]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [116]:
frame.sort_index() #행을 정렬(오름차순)
Out[116]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [117]:
frame.sort_index(axis=1) #열을 정렬 
Out[117]:
a b c d
three 1 2 3 0
one 5 6 7 4
In [118]:
frame.sort_index(axis='columns') #열을 정렬 
Out[118]:
a b c d
three 1 2 3 0
one 5 6 7 4
In [119]:
frame.sort_index(axis='columns',ascending=False) #내림차순 정렬
Out[119]:
d c b a
three 0 3 2 1
one 4 7 6 5
In [120]:
frame.sort_index(axis='columns',ascending=True) #오름차순 정렬
Out[120]:
a b c d
three 1 2 3 0
one 5 6 7 4
In [121]:
obj = pd.Series([4, 7, -3, 2]) 
obj
Out[121]:
0    4
1    7
2   -3
3    2
dtype: int64
In [122]:
obj.sort_values() # 값이 낮은 기준으로 정렬 
Out[122]:
2   -3
3    2
0    4
1    7
dtype: int64
In [123]:
frame = pd.DataFrame({'b': [4, 7, -3, 8], 'a': [0, 1, 2, 3]})
frame
Out[123]:
b a
0 4 0
1 7 1
2 -3 2
3 8 3
In [124]:
frame.sort_values(by=['b','a']) #b를 기준으로 정렬 
Out[124]:
b a
2 -3 2
0 4 0
1 7 1
3 8 3
In [125]:
frame.sort_values(by=['a','b']) # a를 기준으로 먼저 정렬하고 b 정렬 
Out[125]:
b a
0 4 0
1 7 1
2 -3 2
3 8 3
In [126]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) 
obj
Out[126]:
0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64
In [127]:
obj.rank() #순위
Out[127]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
In [128]:
obj.rank(method='first') #먼저 온 순서대로 (중복없음) 
Out[128]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64
In [129]:
obj
Out[129]:
0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64
In [130]:
obj.rank(ascending=False, method='max') # ex) 0과 2가 2로 공동1등이라서 1.5로 적었지만 max를 쓰면 2로 표기됨 
Out[130]:
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64
In [131]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame
Out[131]:
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
In [132]:
frame.rank(axis='columns') # 한 행에 있는 열 값을 기준으로 순서 매김 
Out[132]:
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0
In [133]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
Out[133]:
a    0
a    1
b    2
b    3
c    4
dtype: int64
In [134]:
obj.index.is_unique #a와 b 중복되서 False
Out[134]:
False
In [135]:
obj['a']
Out[135]:
a    0
a    1
dtype: int64
In [136]:
obj['c']
Out[136]:
4
In [137]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
Out[137]:
0 1 2
a 0.817822 1.620150 0.502513
a 0.954089 0.212788 -0.037256
b 0.996862 -1.087917 0.357842
b 1.299607 -0.104178 -2.045602
In [138]:
df.loc['b']
Out[138]:
0 1 2
b 0.996862 -1.087917 0.357842
b 1.299607 -0.104178 -2.045602
In [139]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df
Out[139]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [140]:
df.sum()
Out[140]:
one    9.25
two   -5.80
dtype: float64
In [141]:
df.sum(axis='columns')
Out[141]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
In [142]:
df.mean(axis='columns', skipna=False) #NAN을 skip할건지말건지 
Out[142]:
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
In [143]:
df.mean(axis='columns',skipna=True) #skipna=True 기본값이라 생략가능 
Out[143]:
a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64
In [144]:
df.idxmax()
Out[144]:
one    b
two    d
dtype: object
In [145]:
df.idxmin()
Out[145]:
one    d
two    b
dtype: object
In [146]:
df
Out[146]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [147]:
df
Out[147]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [148]:
df.cumsum() #누적 합
Out[148]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
In [149]:
df.describe() 
Out[149]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000

Unique Values, Value Counts, and Membership

In [150]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj
Out[150]:
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
In [151]:
uniques= obj.unique()
uniques
Out[151]:
array(['c', 'a', 'd', 'b'], dtype=object)
In [152]:
obj.value_counts()
Out[152]:
a    3
c    3
b    2
d    1
dtype: int64
In [153]:
pd.value_counts(obj.values,sort=False)
Out[153]:
c    3
b    2
a    3
d    1
dtype: int64
In [154]:
pd.value_counts(obj.values,sort=True)
Out[154]:
a    3
c    3
b    2
d    1
dtype: int64
In [155]:
obj
Out[155]:
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
In [156]:
mask=obj.isin(['b','c'])
mask
Out[156]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
In [157]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
to_match
Out[157]:
0    c
1    a
2    b
3    b
4    c
5    a
dtype: object
In [158]:
unique_vals = pd.Series(['c', 'b', 'a'])
unique_vals
Out[158]:
0    c
1    b
2    a
dtype: object
In [159]:
pd.Index(unique_vals).get_indexer(to_match) #unique_vals의 c=0 b=1 a=2로 값을 정하고 to_match에서 적용 
Out[159]:
array([0, 2, 1, 1, 0, 2], dtype=int64)
In [166]:
data = pd.DataFrame({'Qu1': [5, 1, 4, 5, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]}) 
In [161]:
data
Out[161]:
Qu1 Qu2 Qu3
0 5 2 1
1 1 3 5
2 4 1 2
3 5 2 4
4 4 3 4
In [162]:
data['Qu1'].value_counts()
Out[162]:
5    2
4    2
1    1
Name: Qu1, dtype: int64
In [163]:
data['Qu1'].value_counts()[:1]
Out[163]:
5    2
Name: Qu1, dtype: int64
In [164]:
data['Qu1'].value_counts()[1:]
Out[164]:
4    2
1    1
Name: Qu1, dtype: int64
In [167]:
result = data.apply(pd.value_counts).fillna(0) # 위의 값 count 확인, fillna(0) : 없는 값은 0으로 바꿔줌  
result
Out[167]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 0.0 2.0 0.0
4 2.0 0.0 2.0
5 2.0 0.0 1.0
  • isin : Series의 각 원소가 넘겨받은 연속된 값에 속하는지 나타내는 bool배열을 반환
  • match : 각 값에 대해 유일한 값을 담고 있는 배열에서의 정수 색인을 계산.
  • unique : Series에서 중복되는 값을 제거하고 유일한 값만 포함하는 배열을 반환
  • value_count : Series에서 유일값에 대한 색인과 두수를 계산 (도수는 내림차순)

+ Recent posts