[Data Viz] Matplotlib 사용법 : Bar Plot

*아래 글은 부스트캠프 AI Tech 3기 안수빈 마스터님의 강의를 정리 및 재구성한 내용입니다.

Data Visualization

2-2. Bar Plot

1) Bar Plot이란?

Bar Plot

- 직사각형 막대를 사용하여 데이터의 값을 표현하는 차트/ 그래프

- 막대 그래프, bar, chart, bar graph 등의 이름으로 사용됨

- 범주(category)에 따른 수치 값을 비교하기에 적합한 방법

2) Bar Plot 그리기

기본 Bar Plot

bar() : 기본적인 bar plot / x 축에 범주, y축에 값을 표기
barh() : horizontal bar plot / y축에 범주, x 축에 값을 표기 / 범주가 많을 때 적합

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 7))

x = list('ABCDE')
y = np.array([1, 2, 3, 4, 5])

axes[0].bar(x, y)
axes[1].barh(x, y)

plt.show()

* 막대 그래프 색 설정 : 막대 그래프의 색은 전체를 변경하거나, 개별로 변경할 수 있다.

fig, axes = plt.subplots(1, 2, figsize=(12, 7))

x = list('ABCDE')
y = np.array([1, 2, 3, 4, 5])

clist = ['blue', 'gray', 'gray', 'gray', 'red']
color = 'green'
axes[0].bar(x, y, color=clist)
axes[1].barh(x, y, color=color)

plt.show()

3) 여러 Group이 있을 때의 Bar Plot 그리기

데이터 살펴보기

* 데이터 : Student Score Dataset 사용

데이터 살펴 보기

student = pd.read_csv('./StudentsPerformance.csv')
student.sample(5) # head 보다 sample을 선호 - head는 위에서부터 5개, sample은 전체에서 5개 샘플링해서 보여줌

결측치 확인, dtype 파악하기

student.info() # 결측치 확인, dtype 파악

통계 정보 살펴보기

student.describe(include='all') # 통계 정보 살펴보기

그룹에 따른 정보를 시각화하기 - 성별에 따른 race/ethnicity 분포

group = student.groupby('gender')['race/ethnicity'].value_counts().sort_index()
display(group)
print(student['gender'].value_counts())

Multiple Bar Plot

- 기본적으로 그리기

fig, axes = plt.subplots(1, 2, figsize=(15, 7))
axes[0].bar(group['male'].index, group['male'], color='royalblue')
axes[1].bar(group['female'].index, group['female'], color='tomato')
plt.show()

- 여기서 두 그룹을 제대로 비교하기 위해선 y축 범위를 맞춰 주어야 한다

- 'sharey' 파라미터를 사용하여 y축의 범위를 공유할 수 있다

fig, axes = plt.subplots(1, 2, figsize=(15, 7), sharey=True) # y축을 공유한다
axes[0].bar(group['male'].index, group['male'], color='royalblue')
axes[1].bar(group['female'].index, group['female'], color='tomato')
plt.show()

- y축의 범위를 개별적으로 조정할 수도 있다. 이렇게 할 때에는 반복문을 사용하여 조정하는 방법을 권장한다

fig, axes = plt.subplots(1, 2, figsize=(15, 7))
axes[0].bar(group['male'].index, group['male'], color='royalblue')
axes[1].bar(group['female'].index, group['female'], color='tomato')

for ax in axes: 	# 모든 axes에 대해 y축 범위를 0~200으로 조정한다.
    ax.set_ylim(0, 200)
    
plt.show()

* multiple bar plot은 그래프가 겹치지 않아서 보기에는 편하지만, group 간의 비교가 어렵다는 단점이 있다

Stacked Bar Plot

- 쌓아서 보면 그룹 A, B, C, D, E에 대한 전체 비율은 알기가 쉽다

- bottom 파라미터를 사용해서 아래 공간을 비워둘 수 있다

fig, axes = plt.subplots(1, 2, figsize=(15, 7))

group_cnt = student['race/ethnicity'].value_counts().sort_index()
axes[0].bar(group_cnt.index, group_cnt, color='darkgray')
axes[1].bar(group['male'].index, group['male'], color='royalblue')
# bottom : female 그래프의 아래 부분을 male 그래프만큼 비워둔다
axes[1].bar(group['female'].index, group['female'], bottom=group['male'], color='tomato') 

for ax in axes:
    ax.set_ylim(0, 350)
    
plt.show()

- Stacked Bar Plot은 가장 아래에 있는 bar의 분포는 파악이 쉽지만 그 외 분포들은 파악이 어렵다

- Stacked Bar Plot을 응용하여 전체 중에서 비율을 나타내는 Percentage Stacked Bar Chart도 있다

fig, ax = plt.subplots(1, 1, figsize=(12, 7))

group = group.sort_index(ascending=False) # 역순 정렬
total=group['male']+group['female'] # 각 그룹별 합


ax.barh(group['male'].index, group['male']/total, 
        color='royalblue')

ax.barh(group['female'].index, group['female']/total, 
        left=group['male']/total, 
        color='tomato')

# percent 표시하기
for index, g in enumerate(group['male']):
    percent = (group['male']/total)[index]
    ax.text(percent/2, index, f"{percent * 100 : .1f}%", ha = 'center', fontweight='bold')
for index, g in enumerate(group['female']):
    percent = (group['female']/total)[index]
    ax.text(1- percent/2, index, f"{percent * 100 : .1f}%", ha = 'center', fontweight='bold')

ax.set_xlim(0, 1)
for s in ['top', 'bottom', 'left', 'right']: # 테두리 없애기
    ax.spines[s].set_visible(False) 

plt.show()

Overlapped Bar Plot

- 2개의 그룹만 있을 때에는 두 그래프를 겹쳐서 보는 방법도 있다

- 같은 축을 사용하기 때문에 비교가 쉽고, alpha라는 파라미터를 이용해 투명도를 조정하여 겹치는 부분을 파악할 수 있다

group = group.sort_index() # 다시 정렬

fig, axes = plt.subplots(2, 2, figsize=(12, 12))
axes = axes.flatten()

for idx, alpha in enumerate([1, 0.7, 0.5, 0.3]):
    axes[idx].bar(group['male'].index, group['male'], 
                  color='royalblue', 
                  alpha=alpha)
    axes[idx].bar(group['female'].index, group['female'],
                  color='tomato',
                  alpha=alpha)
    axes[idx].set_title(f'Alpha = {alpha}')
    
for ax in axes:
    ax.set_ylim(0, 200)
    
    
plt.show()

Grouped Bar Plot

- 그룹 별 범주에 따른 bar를 이웃되게 배치하는 방법

- 가장 권장하는 방식

- Matplotlib으로는 비교적 구현이 까다로움(seaborn에서 구현이 쉬움)

- 앞에 소개한 bar plot들도 마찬가지로, 그룹이 5개~7개 이하일 때 효과적이다

- 그룹이 많다면 빈도가 적은 그룹은 ETC 처리를 하는 등 사전에 처리를 하자

* matplotlib에서 grouped bar plot 만들기 →

x축 조정 & with 조정 & xticks, xticklabels

fig, ax = plt.subplots(1, 1, figsize=(12, 7))

idx = np.arange(len(group['male'].index))
width=0.35

ax.bar(idx-width/2, group['male'], 
       color='royalblue',
       width=width, label='Male')

ax.bar(idx+width/2, group['female'], 
       color='tomato',
       width=width, label='Female')

ax.set_xticks(idx)
ax.set_xticklabels(group['male'].index)
ax.legend()    
    
plt.show()

- 그룹이 3개 이상일 때에는 matplotlib으로 grouped bar plot을 어떻게 그리면 좋을까?

- 그룹의 개수에 따라 막대그래프의 x 좌표는 다음과 같이 변한다

2개 : -1/2, +1/2
3개 : -1, 0, +1 ( -2/2, 0, +2/2)
4개 : -3/2, -1/2, 0, +1/2, +3/2

- 규칙을 보면, -(N-1)/2부터 + (N-1)/2까지 분자에 2간격으로 커지는 것을 알 수 있다

- 그렇다면 index i(zero-index)에 대해서는 다음과 같이 x 좌표를 계산할 수 있다

x + (-N + 1 + 2*i)/2 * width

- 그러면 인종/민족 그룹에 따른 Parental Level of Education을 Grouped Bar Plot으로 그려보자

group = student.groupby('parental level of education')['race/ethnicity'].value_counts().sort_index()
group_list = sorted(student['race/ethnicity'].unique())
edu_lv = student['parental level of education'].unique()

fig, ax = plt.subplots(1, 1, figsize=(13, 7))

x = np.arange(len(group_list))
width=0.12

for idx, g in enumerate(edu_lv):
    ax.bar(x+(-len(edu_lv)+1+2*idx)*width/2, group[g], 
       width=width, label=g)

ax.set_xticks(x)
ax.set_xticklabels(group_list)
ax.legend()    
    
plt.show()

4) Bar Plot Tips

정확한 Bar Plot 그리기

- 잉크 비례의 원칙

색조를 넣은 영역이 수치값을 나타낼 때, 색조가 들어간 영역의 면적은 해당값과 정비례 해야한다.
즉, 실제값과 그에 표현되는 그래픽으로 표현되는 잉크 양은 비례해야 한다.

- 잉크 비례의 원칙에 따라, bar plot의 y축 시작 값은 0이어야 한다

score = student.groupby('gender').mean().T
score

fig, axes = plt.subplots(1, 2, figsize=(15, 7))

idx = np.arange(len(score.index))
width=0.3

for ax in axes:
    ax.bar(idx-width/2, score['male'], 
           color='royalblue',
           width=width)

    ax.bar(idx+width/2, score['female'], 
           color='tomato',
           width=width)

    ax.set_xticks(idx)
    ax.set_xticklabels(score.index)

axes[0].set_ylim(60, 75)  # 잉크 비례의 원칙 위반
axes[0].set_title('Violate the principle of proportion ink')
axes[1].set_title('Folllow the principle of proportion ink')
plt.show()

- 두 데이터 간의 차이를 강조하고 싶은 거라면 그래프를 자르지 말고 plot의 세로 비율을 늘려라

fig, ax = plt.subplots(1, 1, figsize=(6, 10))

idx = np.arange(len(score.index))
width=0.3

ax.bar(idx-width/2, score['male'], 
       color='royalblue',
       width=width)

ax.bar(idx+width/2, score['female'], 
       color='tomato',
       width=width)

ax.set_xticks(idx)
ax.set_xticklabels(score.index)


    
plt.show()

적절한 공간 활용

- 시각화를 좀 더 효과적으로 할 수 있는 다양한 공간 테크닉들은 다음과 같다

X/Y axis Limit (.set_xlim(), .set_ylime())
Margins (.margins())
Gap (width)
Spines (.spines[spine].set_visible())

- 대조군을 위해 2개의 같은 플롯을 그려보자 ( 하나는 기본, 하나는 여백공간을 두고 테두리 조정 )

group_cnt = student['race/ethnicity'].value_counts().sort_index()

fig = plt.figure(figsize=(15, 7))

ax_basic = fig.add_subplot(1, 2, 1)
ax = fig.add_subplot(1, 2, 2)

ax_basic.bar(group_cnt.index, group_cnt)
ax.bar(group_cnt.index, group_cnt,
       width=0.7,
       edgecolor='black',
       linewidth=1,
       color='royalblue'
      )

ax.margins(0.1, 0.1) # 그래프 여백 공간

for s in ['top', 'right']: # 테두리 조정
    ax.spines[s].set_visible(False)

plt.show()

복잡함과 단순함

- 정보량이 너무 많으면 강조하고자 하는 것이 잘 전달되지 않을 수 있고, 정보량이 너무 적으면 데이터를 오해할 수 있다.

- 복잡함과 단순함 사이의 중간점을 잘 찾아야 한다

- 그리드나 텍스트를 추가하여 어떤 게 더 좋을지 고민을 해볼 수 있다

group_cnt = student['race/ethnicity'].value_counts().sort_index()

fig, axes = plt.subplots(1, 2, figsize=(15, 7))

for ax in axes:
    ax.bar(group_cnt.index, group_cnt,
           width=0.7,
           edgecolor='black',
           linewidth=2,
           color='royalblue',
           zorder=10
          )

    ax.margins(0.1, 0.1)

    for s in ['top', 'right']:
        ax.spines[s].set_visible(False)

axes[1].grid(zorder=0)

for idx, value in zip(group_cnt.index, group_cnt):
    axes[1].text(idx, value+5, s=value,
                 ha='center', 
                 fontweight='bold'
                )
        
plt.show()

오차 막대 추가

- 오차막대(errorbar)를 사용하여 편차 등의 정보를 추가할 수 있다

score_var = student.groupby('gender').std().T
score_var

fig, ax = plt.subplots(1, 1, figsize=(10, 10))

idx = np.arange(len(score.index))
width=0.3


ax.bar(idx-width/2, score['male'], 
       color='royalblue',
       width=width,
       label='Male',
       yerr=score_var['male'],
       capsize=10
      )

ax.bar(idx+width/2, score['female'], 
       color='tomato',
       width=width,
       label='Female',
       yerr=score_var['female'], # y축 범위로 에러
       capsize=10
      )

ax.set_xticks(idx)
ax.set_xticklabels(score.index)
ax.set_ylim(0, 100)
ax.spines['top'].set_visible(False) # 오차 막대 위쪽 막아주기
ax.spines['right'].set_visible(False) # 오차 막대 아래쪽 막아주기

ax.legend()
ax.set_title('Gender / Score', fontsize=20)
ax.set_xlabel('Subject', fontweight='bold')
ax.set_ylabel('Score', fontweight='bold')

plt.show()

'부스트캠프 AI Tech 공부 기록 > Data Visualization' 카테고리의 다른 글

[Data Viz] Text (feat.Matplotlib) (0)	2022.02.06
[Data Viz] Matplotlib 사용법 : Scatter Plot (0)	2022.02.05
[Data Viz] Matplotlib 사용법 : Line Plot (0)	2022.02.05
[Data Viz] Matplotlib 사용법 : 기본 (0)	2022.02.03
[Data Viz] 데이터 시각화 개요 (0)	2022.02.03

Study With Me

[Data Viz] Matplotlib 사용법 : Bar Plot

Data Visualization

2-2. Bar Plot

1) Bar Plot이란?

Bar Plot

2) Bar Plot 그리기

기본 Bar Plot

3) 여러 Group이 있을 때의 Bar Plot 그리기

데이터 살펴보기

Multiple Bar Plot

Stacked Bar Plot

Overlapped Bar Plot

Grouped Bar Plot

4) Bar Plot Tips

정확한 Bar Plot 그리기

적절한 공간 활용

복잡함과 단순함

오차 막대 추가

'부스트캠프 AI Tech 공부 기록 > Data Visualization' 카테고리의 다른 글

티스토리툴바

[Data Viz] Matplotlib 사용법 : Bar Plot

Data Visualization

2-2. Bar Plot

1) Bar Plot이란?

Bar Plot

2) Bar Plot 그리기

기본 Bar Plot

3) 여러 Group이 있을 때의 Bar Plot 그리기

데이터 살펴보기

Multiple Bar Plot

Stacked Bar Plot

Overlapped Bar Plot

Grouped Bar Plot

4) Bar Plot Tips

정확한 Bar Plot 그리기

적절한 공간 활용

복잡함과 단순함

오차 막대 추가

'부스트캠프 AI Tech 공부 기록 > Data Visualization' 카테고리의 다른 글

'부스트캠프 AI Tech 공부 기록/Data Visualization' Related Articles

티스토리툴바