[데이터시각화] Seaborn

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Archives

Today

Total

관리 메뉴

코딩하는 해맑은 거북이

[데이터시각화] Seaborn 본문

Data Analysis & Viz

[데이터시각화] Seaborn

#CJE 2023. 3. 27.

본 게시물의 내용은 '부스트캠프 AI Tech - Data Visualization(안수빈)' 강의를 듣고 작성하였다.

해당 글은 아래의 내용을 다룬다.
🍀 Seaborn
🚩 Student Score Dataset
🔷 Countplot
🔷 Categorical API
⚪ Box Plot
  ⚪ Violin Plot
  ⚪ boxenplot, swarmplot, stripplot
🔷 Distribution API
🔹 Univariate Distribution (단일확률분포)
  ⚪ histplot
  ⚪ kdeplo
  ⚪ ecdfplot
  ⚪ rugplot
🔹 Bivariate Distribution (결합확률분포)
  ⚪ histplot
  ⚪ kdeplo
🔷 Relational & Regression API
  ⚪ Scatter Plot
  ⚪ Line Plot (🚩flights)
  ⚪ Regplot
🔷 Matrix API
  ⚪ Heatmap
🚩 Heart Disease
🍀 Seaborn Advanced
🚩 Student Score Dataset & Iris Species
🔷 Joint Plot
🔷 Pair Plot
🔷 Facet Grid
  ⚪ catplot : Categorical
  ⚪ displot : Distribution
  ⚪ relplot : Relational
  ⚪ lmplot : Regression

🍀 Seaborn

: Matplotlib 기반 통계 시각화 라이브러리

- 통계 정보 : 구성, 분포, 관계 등

- Matplotlib 기반이라 Matplotlib으로 커스텀 가능

- 쉬운 문법과 깔끔한 디자인이 특징

- Seaborn은 시각화의 목적과 방법에 따라 API를 분류하여 제공하고 있음

Categorical API - 데이터의 기본 통계량
Distribution API - 범주형/연속형을 모두 살펴볼 수 있는 분포 시각화
Relational API - 관계성 파악
Regression API - 회귀 분석
Multiples API
Theme API
Matrix API - 히트맵

* 해당 실습에서 seaborn 버전 0.11 사용

🚩 Student Score Dataset

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

student = pd.read_csv('./StudentsPerformance.csv')
student.head()

🔷 Countplot

seaborn의 Categorical API에서 대표적인 시각화로 범주를 이산적으로 세서 막대 그래프로 그려주는 함수

x : x label
y : y label
data

sns.countplot(x='race/ethnicity', data=student)

sns.countplot(y='race/ethnicity',data=student)

order : 자료의 순서를 지정할 수 있음

sns.countplot(x='race/ethnicity',data=student,
             order=sorted(student['race/ethnicity'].unique()))

hue : 범례 설정
- hue_order : 범례 순서

sns.countplot(x='race/ethnicity',data=student,
              hue='gender', hue_order=['male', 'female'],
              order=sorted(student['race/ethnicity'].unique()))

palette : 팔레트 색상 변경 가능

sns.countplot(x='race/ethnicity',data=student,
              hue='gender', palette='Set2')

color : hue로 지정된 그룹에 대해 Gradient 색상 지정

sns.countplot(x='gender',data=student,
              hue='race/ethnicity', color='red')

sns.countplot(x='gender',data=student,
              hue='race/ethnicity', 
              hue_order=sorted(student['race/ethnicity'].unique()) , color='red')
              # hue_order : 순서를 정렬해서 보여줌

saturate : 채도 설정

sns.countplot(x='gender',data=student,
              hue='race/ethnicity', 
              hue_order=sorted(student['race/ethnicity'].unique()))

sns.countplot(x='gender',data=student,
              hue='race/ethnicity', 
              hue_order=sorted(student['race/ethnicity'].unique()),
              saturation=0.3    # 채도
             )

ax : ax를 직접 지정할 수 있음

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.countplot(x='race/ethnicity',data=student,
              hue='gender', 
              ax=axes[0]
             )

sns.countplot(x='gender',data=student,
              hue='race/ethnicity', 
              hue_order=sorted(student['race/ethnicity'].unique()), 
              ax=axes[1]
             )

plt.show()

🔷 Categorical API

⚪ Box Plot

사분위수 : 데이터를 4등분한 관측값
- min : -IQR * 1.5 보다 크거나 같은 값들 중 최솟값
- 25% (lower quartile)
- 50% (median)
- 75% (upper quartile)
- max : +IQR * 1.5 보다 작거나 같은 값들 중 최댓값
interquartile range (IQR): 25% ~ 75%
outlier
- whisker : 박스 외부의 범위를 나타내는 선
- outlier : -IQR * 1.5과 +IQR * 1.5을 벗어나는 값

fig, ax = plt.subplots(1,1, figsize=(12, 5))
sns.boxplot(x='math score', data=student, ax=ax)
plt.show()

fig, ax = plt.subplots(1,1, figsize=(10, 5))
sns.boxplot(x='race/ethnicity', y='math score', data=student, 
            order=sorted(student['race/ethnicity'].unique()),
            ax=ax)
plt.show()

fig, ax = plt.subplots(1,1, figsize=(10, 5))

sns.boxplot(x='race/ethnicity', y='math score', data=student,
            hue='gender', # 성별에 따른 분포
            order=sorted(student['race/ethnicity'].unique()),
            ax=ax)

plt.show()

width : box의 폭 너비
linewidth : box plot의 선 너비
fliersize : outlier의 마커 사이즈

fig, ax = plt.subplots(1,1, figsize=(10, 5))

sns.boxplot(x='race/ethnicity', y='math score', data=student,
            hue='gender', 
            order=sorted(student['race/ethnicity'].unique()),
            width=0.3,    # box의 폭 너비
            linewidth=2,  # box plot의 선 너비
            fliersize=10, # outlier의 마커 사이즈
            ax=ax)

plt.show()

⚪ Violin Plot

Box Plot은 실제 분포를 표현하기엔 부족함.

Violin Plot은 실제 분포를 표현하는데 더 적합한 방식 중 하나

* 중앙 흰 점 : 50% (median) / 두꺼운 막대 : IQR / 검은 선 : (-IQR * 1.5) ~ (+IQR * 1.5)의 값

fig, ax = plt.subplots(1,1, figsize=(12, 5))
sns.violinplot(x='math score', data=student, ax=ax)
plt.show()

- Violin Plot의 왜곡

데이터는 연속적이지 않다. (Kernel Density Estimate로 표현되어 연속적이게 보임)
또한 연속적 표현에서 생기는 데이터의 손실과 오차가 존재한다.
데이터의 범위가 없는 데이터까지 표시된다.

- 왜곡을 줄이고 정보량을 높이는 방법

bw : 분포 표현을 얼마나 자세하게 보여줄 것인가
- ‘scott’, ‘silverman’, float
cut : 끝부분을 얼마나 자를 것인가
- float
inner : 내부를 어떻게 표현할 것인가
- “box”, “quartile”, “point”, “stick”, None

fig, ax = plt.subplots(1,1, figsize=(12, 5))
sns.violinplot(x='math score', data=student, ax=ax,
               bw=0.1,    # 구간을 세분화해서 보여줌
               cut=0,     # 얼마나 잘라줄 지
               inner='quartile' # 50, 25, 75%를 점선으로 표현
              )
plt.show()

scale : 각 바이올린의 종류
- “area” : 모든 바이올린 면적이 같음
- “count” : 바이올린 면적이 관측값 수에 비례하도록 스케일링
- “width” : 모든 바이올린의 최대 너비 동일
split : 동시에 비교

fig, ax = plt.subplots(1,1, figsize=(12, 7))
sns.violinplot(x='race/ethnicity', y='math score', data=student, ax=ax,
               order=sorted(student['race/ethnicity'].unique()),
               scale='count'
              )
plt.show()

fig, ax = plt.subplots(1,1, figsize=(12, 7))
sns.violinplot(x='race/ethnicity', y='math score', data=student, ax=ax,
               order=sorted(student['race/ethnicity'].unique()),
               hue='gender',
               split=True,  # 반씩 보여줄 수 있음
               bw=0.2, cut=0
              )
plt.show()

⚪ boxenplot, swarmplot, stripplot

fig, axes = plt.subplots(3,1, figsize=(12, 21))
# box plot + violin plot 의 느낌
sns.boxenplot(x='race/ethnicity', y='math score', data=student, ax=axes[0],
               order=sorted(student['race/ethnicity'].unique()))

# scatter plot으로 나타냄
sns.swarmplot(x='race/ethnicity', y='math score', data=student, ax=axes[1],
               order=sorted(student['race/ethnicity'].unique()))

# 직선 막대에 점을 랜덤하게 뿌려놓은 형태
sns.stripplot(x='race/ethnicity', y='math score', data=student, ax=axes[2],
               order=sorted(student['race/ethnicity'].unique()))
plt.show()

🔷 Distribution API - 단일확률분포, 결합확률분포

범주형/연속형을 모두 살펴볼 수 있는 분포 시각화

🔹 Univariate Distribution (단일확률분포)

histplot : 히스토그램
kdeplot : Kernel Density Estimate
ecdfplot : 누적 밀도 함수
rugplot : 선을 사용한 밀도함수

fig, axes = plt.subplots(2,2, figsize=(12, 10))
axes = axes.flatten()

# 막대그래프 사용해서 분포 나타내는 방식
sns.histplot(x='math score', data=student, ax=axes[0])

# 곡선으로 분포를 나타내는 방식
sns.kdeplot(x='math score', data=student, ax=axes[1])

# 히스토그램을 점진적으로 쌓아서 나타내는 방식
sns.ecdfplot(x='math score', data=student, ax=axes[2])

# 막대로 데이터위치 나타내는 방식 -  데이터간 gap이나 밀도를 살필수있음
sns.rugplot(x='math score', data=student, ax=axes[3])


plt.show()

⚪ histplot

binwidth : 몇 개 단위로 나눌 지
bins : 총 몇 개의 단위로 나눌 지, default = 25

fig, ax = plt.subplots(figsize=(12, 7))

sns.histplot(x='math score', data=student, ax=ax,
             binwidth=50  # 몇개 단위로 나눌지!
            )

plt.show()

fig, ax = plt.subplots(figsize=(12, 7))

sns.histplot(x='math score', data=student, ax=ax,
             bins=100       # 총 몇개의 단위로 만들지
            )

plt.show()

fig, ax = plt.subplots(figsize=(12, 7))

sns.histplot(x='math score', data=student, ax=ax)	# bins : default=25

plt.show()

element : "bars" : default | "step" | "poly"

fig, ax = plt.subplots(figsize=(12, 7))

sns.histplot(x='math score', data=student, ax=ax,
             element='step' # step, poly
            )

plt.show()

fig, ax = plt.subplots(figsize=(12, 7))

sns.histplot(x='math score', data=student, ax=ax,
             element='poly' # step, poly
            )

plt.show()

multiple
- "layer" : default
- "dodge" : 다중 히스토그램
- "stack" : 누적 히스토그램
- "fill" : y축 0~1로 정규화해서 그리기

fig, ax = plt.subplots(figsize=(12, 7))

sns.histplot(x='math score', data=student, ax=ax,
             hue='gender', 
             multiple='stack', # layer, dodge, stack, fill
            )

plt.show()

⚪ kdeplo

fill = True : 내부 채우기

fig, ax = plt.subplots(figsize=(12, 7))
sns.kdeplot(x='math score', data=student, ax=ax,
           fill=True)
plt.show()

bw_method : 분포 표현을 얼마나 자세하게 보여줄지

fig, ax = plt.subplots(figsize=(12, 7))
sns.kdeplot(x='math score', data=student, ax=ax,
           fill=True, bw_method=0.05)
plt.show()

hue : 범례 설정
- hue_order : 범례 순서

fig, ax = plt.subplots(figsize=(12, 7))
sns.kdeplot(x='math score', data=student, ax=ax,
            fill=True, 
            hue='race/ethnicity', 
            hue_order=sorted(student['race/ethnicity'].unique()))
plt.show()

multiple
- "layer" : default
- "stack" : 누적 히스토그램
- "fill" : y축 0~1로 정규화해서 그리기 → 정보의 왜곡이 생길 수 있어 지양한다.

cumulative = True : 누적으로 쌓아서 보여준다
cut : 끝부분을 얼마나 자를 것인가

fig, ax = plt.subplots(figsize=(12, 7))
sns.kdeplot(x='math score', data=student, ax=ax,
            fill=True, 
            hue='race/ethnicity', 
            hue_order=sorted(student['race/ethnicity'].unique()),
            multiple="layer", # layer, stack, fill
            cumulative=True,  # 누적으로 쌓아서 보여줌
            cut=0
           )
plt.show()

⚪ ecdfplot

fig, ax = plt.subplots(figsize=(12, 7))
sns.ecdfplot(x='math score', data=student, ax=ax,
             hue='gender',
             stat='count', # proportion
            #  complementary=True   # 0부터 시작 or 1부터 시작(default) 정함
            )
plt.show()

⚪ rugplot

fig, ax = plt.subplots(figsize=(12, 7))
sns.rugplot(x='math score', data=student, ax=ax)
plt.show()

🔹 Bivariate Distribution (결합확률분포)

2개 이상의 변수를 동시에 분포를 살펴보는 것

histplot : 히스토그램
kdeplot : Kernel Density Estimate

⚪ histplot

cbar : colorbar 여부
bins=(x, y) : x label, y label로 총 몇 개씩 구간을 나눌지

fig, axes = plt.subplots(1,2, figsize=(12, 7))
ax.set_aspect(1)

axes[0].scatter(student['math score'], student['reading score'], alpha=0.2)

# 가독성이 더 좋은 것을 볼 수 있음
sns.histplot(x='math score', y='reading score', 
             data=student, ax=axes[1],
#              color='orange',
             cbar=False,
             bins=(10, 20), # label별로 구간 나눌 수 있음
            )

plt.show()

⚪ kdeplo

fill = True : 내부 채우기
bw_method : 분포 표현을 얼마나 자세하게 보여줄지

fig, ax = plt.subplots(figsize=(7, 7))
ax.set_aspect(1)

sns.kdeplot(x='math score', y='reading score', 
             data=student, ax=ax,
            fill=True,
            # bw_method=0.1
            )

plt.show()

🔷 Relational & Regression API
⚪ Scatter Plot

style, style_order, markers : style로 지정한 범주를 markers에 정의된 모양으로 구분하기
hue, hue_order : hue로 지정한 범주를 hue_order에 정의된 순서대로 색상과 범례를 표시해준다
size, size_order : size로 지정한 범주를 size_order에 정의된 순서대로 크기별로 표시해준다.

fig, ax = plt.subplots(figsize=(7, 7))
sns.scatterplot(x='math score', y='reading score', data=student,
               style='gender', markers={'male':'s', 'female':'o'},
                hue='race/ethnicity', 
                size='writing score',
               )
plt.show()

⚪ Line Plot

flights = sns.load_dataset("flights")
flights.head()

flights_wide = flights.pivot("year", "month", "passengers")
flights_wide.head()

fig, ax = plt.subplots(1, 1,figsize=(12, 7))
sns.lineplot(x='year', y='Jan',data=flights_wide, ax=ax)

fig, ax = plt.subplots(1, 1,figsize=(12, 7))
sns.lineplot(data=flights_wide, ax=ax)
plt.show()

# 자동으로 평균과 표준편차로 오차범위를 시각화해줌
fig, ax = plt.subplots(1, 1, figsize=(12, 7))
sns.lineplot(data=flights, x="year", y="passengers", ax=ax)
plt.show()

style, style_order, markers : style로 지정한 범주를 markers에 정의된 모양으로 구분하기
hue, hue_order : hue로 지정한 범주를 hue_order에 정의된 순서대로 색상과 범례를 표시해준다
dashes = False : 선 스타일인 dash를 False 한다

fig, ax = plt.subplots(1, 1, figsize=(12, 7))
sns.lineplot(data=flights, x="year", y="passengers", hue='month', 
             style='month', markers=True, dashes=False,
             ax=ax)
plt.show()

⚪ Regplot

회귀선을 추가한 scatter plot

fig, ax = plt.subplots(figsize=(7, 7))
sns.regplot(x='math score', y='reading score', data=student)
plt.show()

x_estimator=np.mean : x label의 한 축에 포함된 값들의 평균으로 한 개의 값만 보여준다.

fig, ax = plt.subplots(figsize=(7, 7))
sns.regplot(x='math score', y='reading score', data=student,
            x_estimator=np.mean)
plt.show()

x_bins : 보여주는 개수 설정

fig, ax = plt.subplots(figsize=(7, 7))
sns.regplot(x='math score', y='reading score', data=student,
            x_estimator=np.mean, x_bins=20)
plt.show()

order : 다차원 회귀선으로 바꿀 수 있음

fig, ax = plt.subplots(figsize=(7, 7))
sns.regplot(x='math score', y='reading score', data=student,
            order=2)   # 2차곡선으로 그리기
plt.show()

logx = True : 로그 선으로 나타내기

fig, ax = plt.subplots(figsize=(7, 7))
sns.regplot(x='reading score', y='writing score', data=student,
            logx=True)   # log 선으로 나타내기
plt.show()

🔷 Matrix API
⚪ Heatmap

상관관계(correlation) 시각화에 많이 사용된다.

student.corr()

🚩 Heart Disease

heart = pd.read_csv('./heart.csv')  # 심장병이 있는지 확인하는 데이터
heart.head()

heart.corr()

fig, ax = plt.subplots(1,1 ,figsize=(7, 6))
sns.heatmap(heart.corr(), ax=ax)
plt.show()

vmin, vmax : colorbar의 범위를 조정한다

# 여기서 상관계수는 -1 ~ 1 까지이므로 vmin=-1, vmax=1 로 조정
fig, ax = plt.subplots(1,1 ,figsize=(7, 6))
sns.heatmap(heart.corr(), ax=ax,
           vmin=-1, vmax=1)
plt.show()

center : 서로 다른 데이터에 컬러맵을 중앙에 둘 값 설정, 0은 0을 기준으로 음/양이 나뉘는 정반대의 의미를 가짐

fig, ax = plt.subplots(1,1 ,figsize=(7, 6))
sns.heatmap(heart.corr(), ax=ax,
           vmin=-1, vmax=1, center=0)
plt.show()

cmap : colormap 설정

fig, ax = plt.subplots(1,1 ,figsize=(10, 9))
sns.heatmap(heart.corr(), ax=ax,
           vmin=-1, vmax=1, center=0,
            cmap='coolwarm')
plt.show()

annot : annotation을 보일지 여부
fmt : annotation의 format 설정, "d" : 정수 | ".2f" : 소수점 둘째자리까지 나타낸 실수

fig, ax = plt.subplots(1,1 ,figsize=(10, 9))
sns.heatmap(heart.corr(), ax=ax,
           vmin=-1, vmax=1, center=0,
            cmap='coolwarm',
            annot=True, fmt='.2f' # format : 정수 = d / 실수 : .2f (소수점둘째자리까지)
           )
plt.show()

linewidth : 칸 사이의 선 너비
square : 각각의 칸을 정사각형으로 만들지 여부

fig, ax = plt.subplots(1,1 ,figsize=(12, 9))
sns.heatmap(heart.corr(), ax=ax,
           vmin=-1, vmax=1, center=0,
            cmap='coolwarm',
            annot=True, fmt='.2f',
            linewidth=0.1, square=True    # 정사각형으로 만들기
           )
plt.show()

np.triu_indices_from 함수 사용해서 하삼각행렬로 만들기

fig, ax = plt.subplots(1,1 ,figsize=(10, 9))

mask = np.zeros_like(heart.corr())
mask[np.triu_indices_from(mask)] = True   # 하삼각행렬로 만들기

sns.heatmap(heart.corr(), ax=ax,
           vmin=-1, vmax=1, center=0,
            cmap='coolwarm',
            annot=True, fmt='.2f', 
            linewidth=0.1, square=True, cbar=False,
            mask=mask
           )
plt.show()

🍀 Seaborn Advanced
🚩 Student Score Dataset & Iris Species

import numpy as np 
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

student = pd.read_csv('./StudentsPerformance.csv')
iris = pd.read_csv('./Iris.csv')

🔷 Joint Plot

2개의 feature의 결합확률 분포와 각각의 분포도를 살필 수 있는 시각화 방법

height : figure의 크기

sns.jointplot(x='math score', y='reading score',data=student,
             height=7)

hue : 범례 설정

sns.jointplot(x='math score', y='reading score',data=student,
              hue='gender')

kind : 다양한 종류로 분포를 확인할 수 있다. → "scatter" | "kde" | "hist" | "hex" | "reg" | "resid"
- "hex" | "reg" 는 hue와 같이 사용할 수 없다.
fill : 그래프 안을 채울지 여부

sns.jointplot(x='math score', y='reading score', data=student,
              hue='gender', 
              kind='kde', # { “scatter” | “kde” | “hist” | “hex” | “reg” | “resid” }, 
              fill=True)

sns.jointplot(x='math score', y='reading score', data=student,
              kind='reg', # { “scatter” | “kde” | “hist” | “hex” | “reg” | “resid” }, 
             )

sns.jointplot(x='math score', y='reading score', data=student,
              kind='hex', # { “scatter” | “kde” | “hist” | “hex” | “reg” | “resid” }, 
             )

🔷 Pair Plot

iris.head()

sns.pairplot(data=iris)

hue : 범례 설정

sns.pairplot(data=iris, hue='Species')

kind : 전체 서브플롯을 조정 : 'scatter' | 'kde' | 'hist' | 'reg'
diag_kind : 대각 서브플롯만 조정 : 'auto' | 'hist' | 'kde' | None

sns.pairplot(data=iris, hue='Species', kind='hist')

sns.pairplot(data=iris, hue='Species', diag_kind='hist')

corner=True : 하삼각행렬만 보는 방법

sns.pairplot(data=iris, hue='Species', corner=True) # corner로 하삼각행렬만 보여주기

🔷 Facet Grid

pair plot은 feature간의 관계를 살폈다면, Facet Grid는 feature간의 관계 뿐만 아니라 feature's category-feature's category의 관계도 살필 수 있다.

⚪ catplot : Categorical

Categorical scatterplots:
- stripplot() (with kind="strip"; the default)
- swarmplot() (with kind="swarm")
Categorical distribution plots:
- boxplot() (with kind="box")
- violinplot() (with kind="violin")
- boxenplot() (with kind="boxen")
Categorical estimate plots:
- pointplot() (with kind="point")
- barplot() (with kind="bar")
- countplot() (with kind="count")

student.head()

sns.catplot(x="race/ethnicity", y="math score", hue="gender", data=student) # kind default:stripplot

행과 열의 카테고리를 기반으로 해당 그래프의 개수가 조정된다

sns.catplot(x="race/ethnicity", y="math score", hue="gender", data=student,
            kind='box', col='lunch', row='test preparation course')

⚪ displot : Distribution

histplot() (with kind="hist"; the default)
kdeplot() (with kind="kde")
ecdfplot() (with kind="ecdf"; univariate-only)

sns.displot(x="math score", hue="gender", data=student,
           col='race/ethnicity', # kind='kde', fill=True
            col_order=sorted(student['race/ethnicity'].unique())  # 정렬된 순으로
           )

⚪ relplot : Relational

scatterplot() (with kind="scatter"; the default)
lineplot() (with kind="line")

sns.relplot(x="math score", y='reading score', hue="gender", data=student,
           col='lunch')

⚪ lmplot : Regression

regplot()

sns.lmplot(x="math score", y='reading score', hue="gender", data=student)

'Data Analysis & Viz' 카테고리의 다른 글

[데이터시각화] Pie Charts (0)	2023.03.27
[데이터시각화] Polar Coordinate - Polar, Radar Plot (0)	2023.03.27
[데이터시각화] Grid, 선/면 추가, 테두리 설정, Setting 변경 (0)	2023.03.22
[데이터시각화] Facet (0)	2023.03.22
[데이터시각화] Color (0)	2023.03.22

'Data Analysis & Viz' Related Articles

Comments

코딩하는 해맑은 거북이

[데이터시각화] Seaborn 본문

[데이터시각화] Seaborn

'Data Analysis & Viz' 카테고리의 다른 글

티스토리툴바