比較多組不同變異數獨立樣本平均值檢定 (Welch's Test for Analysis of Variance,parametric)
套路 15: 比較多組不同變異數獨立樣本平均值檢定
(Welch's Test for Analysis of Variance,parametric)
什麼是比較多組不同變異數獨立樣本平均值檢定? 說白了就是多組分別獨立取樣的資料做比較的假設檢定。**注意** ”比較變異數” (comparing variances)和”變異數分析” (analysis of variance)不同。變異數分析是多組資料比較平均值。統計假設檢定檢定什麼?看H0。例如多組獨立樣本假設檢定H0 : μ1
= μ2 = … = μk,HA : 至少有一組平均值不同,是檢定多組資料的平均值是否相同。假設相等時為雙尾 (two-tailed test) 檢定。
1. 使用時機: 用於比較多組不同變異數獨立樣本平均值(mean)。若自變項有一個,就是單因子變異數分析。
2. 分析類型: 母數(parametric)分析。直接使用資料數值算統計叫parametric方法,把資料排序之後用排序的名次算統計叫non-parametric方法。
3. 前提假設: 使用母數(parametric)分析時,資料須為常態分布(normal distribution)。多組資料不同變異數使用Welch's Test for Analysis of Variance。
4. 範例資料: 咪路調查土壤鈣離子濃度(mg/100 mg soil),資料如下:
土樣1
|
土樣2
|
土樣3
|
17.8
|
28.5
|
30.5
|
17.3
|
28.8
|
33.8
|
16.1
|
29.4
|
31.5
|
24.2
|
28.5
|
32.1
|
25.3
|
28.3
|
29.9
|
25.7
|
28.4
|
29.6
|
H0: m1 = m2 = m3。 HA: 不同土壤樣本鈣離子濃度不完全相同。
5. 畫圖看資料分布:
wt = [17.8,17.3,16.1,24.2,25.3,25.7,28.5,28.8,29.4,28.5,28.3,28.4,30.5,33.8,31.5,32.1,29.9,29.6]
cl = ["S1","S1","S1","S1","S1","S1","S2","S2","S2","S2","S2","S2","S3","S3","S3","S3","S3","S3"]
dat = {'Conc':wt,'Soil':cl}
import pandas as pd
df = pd.DataFrame(dat)
import seaborn as sns
sns.set(style="whitegrid")
ax = sns.boxplot(x = "Soil",
y = "Conc", data = df, width=0.2, palette="Set3")
ax = sns.swarmplot(x = "Soil",
y = "Conc", data = df, color = "red")
結果:
6. 檢查資料是否為常態分布 (H0:資料為常態分佈):
dat1 = [17.8,17.3,16.1,24.2,25.3,25.7]
dat2 = [28.5,28.8,29.4,28.5,28.3,28.4]
dat3 = [30.5,33.8,31.5,32.1,29.9,29.6]
import scipy.stats
scipy.stats.shapiro(dat1)
結果: (0.8173024654388428,
0.0836217924952507)
p = 0.0836 > 0.05,接受H0:資料為常態分佈。
scipy.stats.shapiro(dat2)
結果: (0.8241334557533264,
0.0957956612110138)
p = 0.0957 > 0.05,接受H0:資料為常態分佈。
scipy.stats.shapiro(dat3)
結果: (0.9362186789512634,
0.6289042234420776)
p = 0.6289 > 0.05,接受H0:資料為常態分佈。
7. 檢查資料是否為相同變異數 (H0: s12 = s22 = s32):
方法: Levene test for
equal variances (parametric test)
dat1 = [17.8,17.3,16.1,24.2,25.3,25.7]
dat2 = [28.5,28.8,29.4,28.5,28.3,28.4]
dat3 =
[30.5,33.8,31.5,32.1,29.9,29.6]
import scipy.stats
scipy.stats.levene(dat1, dat2, dat3,
center = 'mean')
結果: LeveneResult(statistic=53.10491367861894,
pvalue=1.5636901412229339e-07)
p = 1.5637e-7 < 0.05,不接受H0: s12 = s22 = s32。
# 相同變異數表示樣本來自相同母體(population),不同變異數表示樣本取樣自不同母體。
8. 使用Python計算多組不同變異數獨立樣本Welch's Test for Analysis of Variance:
方法: 使用pingouin (welch_anova)
conda install -c
conda-forge pingouin # 安裝pingouin
# 安裝成功之後執行下列程式
wt =
[17.8,17.3,16.1,24.2,25.3,25.7,28.5,28.8,29.4,28.5,28.3,28.4,30.5,33.8,31.5,32.1,29.9,29.6]
cl = ["S1","S1","S1","S1","S1","S1","S2","S2","S2","S2","S2","S2","S3","S3","S3","S3","S3","S3"]
dat =
{'Conc':wt,'Soil':cl}
import pandas as
pd
df =
pd.DataFrame(dat)
from pingouin
import welch_anova
welch_anova(dv='Conc',
between='Soil', data = df)
結果:
Source
ddof1 ddof2 F
p-unc
0 Soil
2 7.127 15.195
0.002688
p = 2.688e-3 < 0.05,不接受H0: μ1 = μ2
= μ3,不同土壤樣本鈣離子濃度不完全相同。
土壤樣本鈣離子濃度不同是誰跟誰有差呢?要進一步做multiple comparison (Tukey HSD):
from
statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp
import MultiComparison
mc =
MultiComparison(df['Conc'], df['Soil'])
tkresult =
mc.tukeyhsd()
print(tkresult)
結果如下表:
Multiple
Comparison of Means - Tukey HSD, FWER=0.05
====================================================
group1 group2
meandiff p-adj lower upper
reject
----------------------------------------------------
S1
S2 7.5833 0.001
3.4882 11.6785 True (有差)
S1
S3 10.1667 0.001
6.0715 14.2618 True (有差)
S2
S3 2.5833 0.2609 -1.5118 6.6785
False (沒差)
----------------------------------------------------
留言
張貼留言