Cluster analysis classifies a large number of Kinta River data set by using HACA Since it has prove to be a very effective way in clustering (Juahir et., al 2010) . Commonly it has classified to three main classes which are Low Pollution sources (LPS), Medium pollution Sources (MPS) and High Pollution Sources (HPS). The main objective in cluster analysis is to determine the number of groups for the data to be separate (Steven 2006 et al). The pairing process is conducted by the strength of the level of similarity (Steven 2006 et al) and also the natural characteristic of water in those locations. Euclidean distances in Ward's Method have use as the measure of similarity which represents the quotient between the linkage distance divide by the maximal distance. The quotient usually multiplied by 100 as a way to standardize the linkage distance represent by the y-axis (Shresta and Kazama,2007; Singh et al.,2004,2005)
Dendrogram in figure 4.1 Shows that all 6 stations in Kinta river had been paired together, 2PK22 with 2PK24 in class 1, 2PK19 with 2PK33 in class 2, and 2PK25 with 2PK34 in class 3. The stations in cluster 1 located at the upstream of the Kinta River where human activities only on recreation and Kg Orang Asli. The location is less polluted because the area is only forest and mountainous part. For the stations in cluster 2, it lies at the downstream of the Kinta River and directly flow into Sg Perak. While for Stations in cluster 3 is located at the middle of the Town Ipoh with high dense of populations and have many human activities such as industries and agriculture. With due to Vega, municipal wastewater, manure discharges and runoff€ from agricultural fields, roadways and streets, are the major factors that contribute to river pollution since all wastewater will flow into the river (Vega et al., 1998).
This result obtained implies for further assessment on Kinta River by reducing the number of stations needed to be monitored for the whole network. 3 sampling stations are sufficient for monitoring process. The cost for sampling also can be reduce and provides a better solution to improve river water quality. It showed that this clustering process by HACA is very useful in offering reliable information to determine the water quality for the whole region.
Tree diagram from cluster analysis in Q-mode for water samples of Kinta River shows three distinct groupings.
Cluster 2
Cluster 1
Cluster 3
Figure 4.1
Descriptive Statistic
Referring to the six sampling sites manned by DOE in the present report, station 2PK22 and 2PK24 are situated at upstream river, in the region of slightly low river pollution. Meanwhile, station 2PK25 and 2PK34 are located in the middle stream area where the river pollution is extremely high. In the meantime, station 2PK19 and 2PK33 are located in the region of downstream river, an area with moderate river pollution.
Water quality are varies depends on the types of land use activities. In probability theory and statistics, the variance is used to measure how far a set of numbers are spread out from each other, describing how far the numbers lie from the mean (expected value) either the actual probability distribution of an observed population of numbers, or the theoretical probability distribution of a not-fully-observed population of numbers. The variance of a distribution is the mean, of the squared deviation of that variable from its expected value or mean. Thus the variance is a measure of the amount of variation of the values of that variable, taking into account of all possible values and their probabilities or weightings and not just the extremes which give the range.
Based on the Table 4.1.1 and Table 4.1.2 below, in the upper stream and middle stream region, variable such as As, Hg, Cd, Cr, and Pb showed none variation in the data sets as well as in downstream region except for variable As that shows very little variation. This may due to the consistency of the data sets for all five variables along Kinta River. Meanwhile, in the upstream river region, the descriptive statistic shows that standard deviation for SS, COND, TUR, DS, TS, Cl, and Na are extremely high (Table 4.1.1).
Standard deviation is a widely used measurement of variability or diversity used in statistics and probability theory. It refers to how much variation or "dispersion" there is from the mean. A high standard deviation for variables as stated before shows that the data are spread out over a large range of values. Whereas a low standard deviation indicates that the data points tend to be very close to the mean, in the other words, it refers to a very little dispersion from mean value. Technically, the standard deviation of the data set is the square root of its variance, which will be discussed also in the present report.
In the present report, a highest standard deviation value for E-coli was observed at downstream river compared with upstream and middle stream area. This indicated that E-coli data sets at downstream were very far away from the mean value. Meanwhile, the standard deviation which shown number zero, had appeared at upstream, middle stream and downstream river except for station 2PK33 for variables Hg (Table 4.1.1, Table 4.1.2 and Table 4.1.3). This may due to the existence of only one data value measured for Hg. Thus, standard deviation was same as the mean value. Table 4.1.1, Table 4.1.2 and Table 4.1.3 below showed the summarized descriptive statistics of 5 years data set for all six stations along Kinta River.
Table 4.1.1 Summarized descriptive statistic for upstream river.
Variable
Station 2PK22
Station 2PK24
Mean
SD
Sample Variance
Mean
SD
Sample Variance
DO
7.9650
1.0499
1.1023
6.2250
1.4405
2.0750
BOD
1.3500
0.4894
0.2395
8.3000
15.0581
226.7474
COD
22.8500
7.6038
57.8184
42.1500
32.2854
1042.3450
SS
240.2500
328.5395
107938.2000
251.4500
409.4256
167629.3000
pH
6.9640
0.2887
0.0833
7.2170
0.2706
0.0732
NH3-NL
0.2410
0.8747
0.7650
0.2600
0.3569
0.1274
TEMP
26.1235
2.9835
8.9013
26.6595
2.0822
4.3355
COND
822.7350
3523.8830
12417753.0000
110.6550
67.2350
4520.5410
SAL
0.4720
2.0567
4.2301
0.0490
0.0316
0.0010
TUR
259.5100
298.7566
89255.5200
208.0750
257.7616
66441.0400
DS
392.8000
1686.9870
2845926.0000
50.5500
25.9463
673.2079
TS
633.0500
1674.6100
2804312.0000
302.0000
399.4760
159581.0000
NO3
0.2270
0.1278
0.0163
0.5723
0.4720
0.2228
Cl
221.1750
983.5938
967456.8000
3.9750
3.2545
10.5915
PO4
0.1663
0.4728
0.2235
0.1278
0.4801
0.2305
As
0.0020
0.0019
0.0000
0.0025
0.0016
0.0000
Hg
0.0001
0.0000
0.0000
0.0001
0.0000
0.0000
Cd
0.0006
0.0002
0.0000
0.0006
0.0002
0.0000
Cr
0.0012
0.0014
0.0000
0.0029
0.0041
0.0000
Pb
0.0053
0.0011
0.0000
0.0055
0.0015
0.0000
Zn
0.0304
0.0130
0.0002
0.0449
0.0331
0.0011
Ca
6.5080
23.1950
538.0083
7.2830
7.6191
58.0505
Fe
0.4244
0.3535
0.1250
0.4006
0.3386
0.1146
K
5.6000
18.5284
343.3019
3.9695
2.8152
7.9256
Mg
12.5716
54.0073
2916.7929
1.1220
0.5280
0.2788
Na
111.2835
479.8701
230275.2687
6.9625
4.4235
19.5674
OG
0.5250
0.1118
0.0125
0.5000
0.0000
MBAS
0.0250
0.0000
0.0000
0.0250
0.0000
0.0000
E-coli
6571.0000
7571.6076
57329241.0526
24620.0000
25271.7983
638663789.4737
Coliform
29730.0000
32035.1632
1026251684.2105
100040.0000
120877.5039
14611370947.3684
Table 4.1.2 Summarized descriptive statistic for middle stream river.
Variable
Station 2PK25
Station 2PK34
Mean
SD
Sample Variance
Mean
SD
Sample Variance
DO
2.8960
1.5680
2.4585
3.5235
1.3662
1.8664
BOD
6.2500
3.8508
14.8290
4.4000
2.2337
4.9895
COD
39.4000
11.4818
131.8316
35.7000
14.4445
208.6421
SS
257.1000
465.0055
216230.1000
143.9500
121.4251
14744.0500
pH
6.9330
0.3020
0.0912
7.0015
0.2322
0.0539
NH3-NL
1.0990
1.0212
1.0429
1.0853
0.9105
0.8290
TEMP
28.0830
1.6496
2.7210
29.0615
1.7126
2.9330
COND
222.2650
126.2357
15935.4500
176.2750
51.9880
2702.7570
SAL
0.1025
0.0685
0.0047
0.0790
0.0249
0.0006
TUR
235.4400
400.0680
160054.4000
150.2250
120.6443
14555.0600
DS
99.4500
72.1865
5210.8921
90.3000
33.1854
1101.2740
TS
356.5500
451.7171
204048.4000
234.2500
116.0249
13461.7800
NO3
0.7208
0.5887
0.3465
0.6118
0.2121
0.0450
Cl
9.8500
16.2619
264.4500
7.9500
5.1959
26.9974
PO4
0.1785
0.4952
0.2452
0.1330
0.4408
0.1943
As
0.0118
0.0117
0.0001
0.0117
0.0104
0.0001
Hg
0.0001
0.0000
0.0000
0.0001
0.0000
0.0000
Cd
0.0006
0.0002
0.0000
0.0006
0.0002
0.0000
Cr
0.0042
0.0058
0.0000
0.0045
0.0056
0.0000
Pb
0.0053
0.0011
0.0000
0.0055
0.0015
0.0000
Zn
0.0658
0.0554
0.0031
0.0369
0.0202
0.0004
Ca
18.3705
10.9555
120.0239
18.5225
7.0208
49.2910
Fe
0.3376
0.3913
0.1531
0.2296
0.2502
0.0626
K
4.7720
2.2540
5.0805
4.4966
1.9021
3.6181
Mg
2.5120
1.2496
1.5616
2.6517
0.9069
0.8225
Na
11.9975
14.5249
210.9740
9.6846
4.4999
20.2487
OG
0.0000
0.5000
0.0000
0.5000
0.0000
0.0000
MBAS
0.0250
0.0000
0.0000
0.0250
0.0000
0.0000
E-coli
39755.0000
68124.5259
4640951026.3158
24360.0000
26038.8657
678022526.3158
Coliform
93475.0000
105500.8599
11130431447.3684
72715.0000
71928.2704
5173676078.9474
Table 4.1.3 Summarized descriptive statistics for downstream river.
Variable
Station 2PK19
Station 2PK33
Mean
SD
Sample Variance
Mean
SD
Sample Variance
DO
2.9645
0.8604
0.7404
4.9510
1.4366
2.0637
BOD
2.4500
0.8870
0.7868
2.6000
1.9574
3.8316
COD
28.5000
10.4806
109.8421
29.2500
13.4785
181.6711
SS
64.7000
60.6735
3681.2740
115.4500
145.1828
21078.0500
pH
6.7740
0.2788
0.0777
6.9465
0.2858
0.0817
NH3-NL
0.1368
0.1891
0.0358
0.3288
0.5432
0.2951
TEMP
28.7815
1.1635
1.3537
28.0810
1.7052
2.9078
COND
124.7200
34.1352
1165.2100
102.1950
58.3370
3403.1990
SAL
0.0535
0.0179
0.0003
0.0440
0.0280
0.0008
TUR
82.4000
67.7138
4585.1580
123.1700
148.0190
21909.6300
DS
57.2500
15.5931
243.1447
49.7500
27.7430
769.6711
TS
121.9500
59.7094
3565.2100
165.2000
147.3362
21707.9600
NO3
0.5770
0.3270
0.1069
0.4800
0.1767
0.0312
Cl
4.7500
6.5604
43.0395
3.6000
5.6768
32.2263
PO4
0.1285
0.3697
0.1367
0.2018
0.4954
0.2455
As
0.0068
0.0056
0.0000
0.0055
0.0053
0.0000
Hg
0.0001
0.0000
0.0000
0.0001
0.0002
0.0000
Cd
0.0005
0.0001
0.0000
0.0005
0.0001
0.0000
Cr
0.0032
0.0035
0.0000
0.0025
0.0028
0.0000
Pb
0.0058
0.0018
0.0000
0.0053
0.0011
0.0000
Zn
0.0326
0.0152
0.0002
0.0314
0.0148
0.0002
Ca
10.8748
4.3071
18.5510
9.4816
5.1402
26.4213
Fe
0.3960
0.3464
0.1200
0.2753
0.2026
0.0410
K
3.3406
0.8558
0.7324
2.4735
1.1719
1.3734
Mg
2.1914
0.6200
0.3844
1.8161
0.9122
0.8321
Na
5.8165
2.2581
5.0990
4.8384
2.3720
5.6264
OG
0.5000
0.0000
0.0000
0.0000
0.5000
0.0000
MBAS
0.0250
0.0000
0.0000
0.0250
0.0000
0.0000
E-coli
3522.5000
5975.4426
35705914.4737
9115.0000
8827.6644
77927657.8947
Coliform
49825.0000
92226.4711
8505721973.6842
49740.0000
65661.3192
4311408842
Discriminant Analysis
The grouping made by cluster analysis before were then confirm by Discriminant Analysis using three types of DA which are Standard Mode, Forward Stepwise Mode and Backward Stepwise Mode . Standard mode consider all parameters, backward stepwise mode is remove variable by variable from less significant until resulting zero significant. While forward stepwise mode, the parameter are include one by one from the most significant until no significant obtained (Juahir et.,al 2010). The clustering were then treated as dependent variables while water quality as the independent variables ( Juahir.H et al 2010). From the result obtained in Figure 4.2, the correct percentage for standard,forward stepwise and backward stepwise were 91.67% ( Ten discriminant variables ),75.83% ( Five discriminant variables) and 90.00% (seven discriminant variables).
With the high percentage of backward stepwise mode were then selected. Firstly, regions assigned by DA for HPS, there are 40 observations where 35 are correctly class as HPS and 5 of the observations is wrongly class into HPS where it should be class under MPS due to the similarity characteristic. Next is regions assigned by DA for LPS. It has total observations of 40 with correctly assign to LPS class is 37 and 3 observations are wrong classifies and should be 1 observation under HPS and 2 Observation under MPS regions. The other 40 observations are assigned to MPS with correct division is 36 into MPS region, 4 are wrong and should be 3 under HPS and 1 under LPS region.
Figure 4.2: Classification matrix for DA of spatial variations in Kinta River
Sample Regions
% Correct
Regions Assigned by DA
HPS
LPS
MPS
Standard DA Modes ( 10 Variables)
HPS
90.00%
36
0
4
LPS
90.00%
1
36
3
MPS
95.00%
1
1
38
Total
91.67%
38
37
45
Forward Stepwise Mode ( 5 Variables)
HPS
67.50%
27
1
12
LPS
82.50%
2
33
5
MPS
77.50%
4
5
31
Total
75.83%
33
39
48
Backward Stepwise Mode ( 7 Variables)
HPS
87.50%
35
0
5
LPS
92.50%
1
37
2
MPS
90.00%
3
1
36
Total
90.00%
39
38
43
DA standard
DA (Forward)
DA (Backward)
Backward Stepwise mode was then selected to use for further discussion. Backward stepwise Mode had identified DO, NH3-NL, As,Cr,Zn,Ca, and E-coli as the most significant parameters contribute to WQI in Kinta River with 90.00% of correct percentage. This means that all this parameters have high variation in terms of their spatial distribution. All the significant parameters for standard, backward and forward DA were showed in figure 4.3.
Most Significant Parameters.
Variable
standard
backward
forward
p-value
p-value
p-value
DO
< 0.0001
< 0.0001
< 0.0001
pH
0.002
NH3-NL
< 0.0001
< 0.0001
< 0.0001
TEMP
< 0.0001
NO3
0.007
As
< 0.0001
< 0.0001
< 0.0001
Cr
0.041
0.041
Zn
0.017
0.017
0.017
Ca
< 0.0001
< 0.0001
E-coli
0.002
0.002
0.002
Figure 4.3
There are various activities were carried out along the Kinta River such as industries, irrigation, agriculture and also residential area. All this activities is the major contributor to river pollution. Seven parameters have been identified before using backward stepwise mode DA which acts as the main significant parameters in polluting Kinta River. The parameters are DO, NH3-NL, As,Cr,Zn,Ca, and E-coli. From the table above, the p value for seven parameters are < 0.0001 for DO, < 0.0001 for NH3-NL, < 0.0001 for As, 0.041 for Cr, 0.017 for Zn, < 0.0001 for Ca and 0.002 for E-Coli. All the p-values are less than 0.05 which is the significant level by using 95%. (buat correlation untuk parameters ini)
Dissolve Oxygen which known as DO. It can best describe as the total of oxygen molecules dissolved in water. The amount of dissolve oxygen in water is based on the organic compound in water body. When organic compound increase, dissolve oxygen will decrease. This because, the process of decompose organic waste by bacteria needs oxygen. In this case the organic compounds in water can be affecting by human activities such as discharge untreated waste into river, runoff from dairies, feedlots, and other agricultural operations along Kinta River.
Ammonia nitrogen (NH3-NL) basely came from agricultural runoff where fertilizer use in agriculture activities and industrial discharge (Fisher et.,al 2000). Arsenic (As) is came from the sources like glass and wood. Chromium (Cr) mainly contaminated into water body through Electroplating, leather tanning, and textile industries.This is respectively from industries. Zinc (Zn) usually use as roof for houses. Because of the Kinta River flows across the city, a lot of houses and buildings at that area can be the sources of this pollutant in water. Zn also can be mobilizing easily into stream and atmosphere when contacted with acid rain or smog. Next is Calcium. Ca can be the component of water hardness where it naturally occurs in river water due to the landuse mostly related to agriculture and forest area. While E-coli are mainly come from the sources related to wastewater treatment plant and municipal sewage (Frenzel and Couvillion 2002) also from animal husbandry and oxidation pond. Box plot for seven parameters above are constructed and shown in figure 4.4.
Box Plots
In descriptive statistics, a box plot or box plot (also known as a box-and-whisker diagram or plot) is a convenient way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any, might be considered outliers. In this exercise, box plot are used in order to graphically shows the descriptive statistic and also to identified the outliers in every variables for each stations. The box plot is interpreted as shown in the Figure 1 below which the box itself contain the middle 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile of the data set, and the lower hinge indicates the 25th percentile. The range of the middle two quartiles is known as the inter-quartile range. The straight horizontal line in the box represent the median value of the data while the plus sign " +" indicates the means of the data sets. The ends of the vertical lines or whiskers represent the minimum and maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range. Thus, the points outside the ends of the whiskers are outliers or suspected outliers which will be discussed in detail in discussion later.
Figure 1 : Box Plot diagram
Box plot in figure 4.4 are the distribution pattern of 7 parameters which contribute to WQI in Kinta River.
Figure 4.4(a)
The box plot showed the distribution for LPS,MPS and HPS class. For LPS, the mean is the highest compare to MPS and HPS. Low concentration of DO mean, the water is polluted and high concentration of DO mean, the water is clean. Too much organic matter in water body will cause decreasing of DO concentration. This because, the decomposing of organic matter by bacteria need oxygen.
While for NH3-NL, the maximum value for HPS class is very obvious than in MPS and LPS. The mean also showed that HPS got the highest. This distribution is normal because for the location which labeled as high pollution sources have high concentration of NH3-NL. This pollutant may come from the usage of fertilizer in agricultural areas along Kinta River. Therefore, the sources also can be the untreated wastewater discharge from industries.
Figure 4.4(b)
For Zinc, the distribution data between all three classes showed that HPS class also got the highest maximum value, mean, and median. There also 3 outliers which are unexpected high than all the values. For LPS class also has 2 outliers but it is still under range. The mean for LPS showed a little bit high than the mean for MPS. This may be due to some technical problem with the monitoring equipment or maybe there are changes in weather. Actually, Zn is used as the house roof. When it contact with smog, it will be dispersed in atmosphere. Atmosphere is one of the medium transportation for pollutant to enter the water.
Next box plot is the chromium distribution data. There are 5 outliers among the three classes with the highest outliers is for HPS class. The mean for LPS is the lowest and mean for HPS is the highest among three classes. The sources of chromium are usually from industries.
Figure 4.4(c)
The distribution for arsenic (Ar) and Calsium(Ca) shows in the box plots above. Both of the distribution is normal with the highest mean, median, and maximum value was located under HPS class. No outliers for Ar but for Ca, there is a few outliers located under LPS and HPS. The most extreme value is under LPS region. Arsenic and calcium both are naturally occurs under soils or rock. Calcium usually will determine water hardness, mean when high calcium concentration, the water became hard.
Figure 4.4(d)
Box plot above showed the distribution of E-coli data. The highest mean and median is under HPS, second is LPS and the lowest is MPS. The most extreme value is under HPS but the distribution is still normal. Overall, the distribution of all parameters showed in boxplot figure is normal with the highest mean for each parameters is under HPS class and the lowest mean of DO under LPS class. There are several extreme outliers for each parameter. It may cause from the changing of weather, too hot, rainy day and others. Therefore, it also may cause from the misfunctional of monitored equipments or other technical problem.
Anova Test.
Anova test was then run to see the correlation among all the significant parameters determined by backward stepwise DA in Figure 4.3. The result were showed in figure 4.5.
Source of Variation
SS
MS
F
P-value
F crit
Between Groups
33280278852
5546713142
33.97917529
8.62E-37
2.109447
Within Groups
1.35978E+11
163238604.1
Total
1.69258E+11
Figure 4.5
H null : There is no significant among all parameters
H alternative : There is significant among all parameters
From the figure above, the p-value is 8.62E-37. Because of the p-value is too small, and less than 0.05 null hypothesis should be rejected and there is significant among all the parameters determined by the backward stepwise mode before. Therefore, F-value is greater than F-crit which strongly prove that there is high relationship among seven parameters which are DO, NH3-NL, As,Cr,Zn,Ca, and E-coli in affecting WQI at Kinta River.