Large Number Of Kinta River Data Set Biology Essay

Published: November 2, 2015 Words: 3380

Cluster analysis classifies a large number of Kinta River data set by using HACA Since it has prove to be a very effective way in clustering (Juahir et., al 2010) . Commonly it has classified to three main classes which are Low Pollution sources (LPS), Medium pollution Sources (MPS) and High Pollution Sources (HPS). The main objective in cluster analysis is to determine the number of groups for the data to be separate (Steven 2006 et al). The pairing process is conducted by the strength of the level of similarity (Steven 2006 et al) and also the natural characteristic of water in those locations. Euclidean distances in Ward's Method have use as the measure of similarity which represents the quotient between the linkage distance divide by the maximal distance. The quotient usually multiplied by 100 as a way to standardize the linkage distance represent by the y-axis (Shresta and Kazama,2007; Singh et al.,2004,2005)

Dendrogram in figure 4.1 Shows that all 6 stations in Kinta river had been paired together, 2PK22 with 2PK24 in class 1, 2PK19 with 2PK33 in class 2, and 2PK25 with 2PK34 in class 3. The stations in cluster 1 located at the upstream of the Kinta River where human activities only on recreation and Kg Orang Asli. The location is less polluted because the area is only forest and mountainous part. For the stations in cluster 2, it lies at the downstream of the Kinta River and directly flow into Sg Perak. While for Stations in cluster 3 is located at the middle of the Town Ipoh with high dense of populations and have many human activities such as industries and agriculture. With due to Vega, municipal wastewater, manure discharges and runoff€ from agricultural fields, roadways and streets, are the major factors that contribute to river pollution since all wastewater will flow into the river (Vega et al., 1998).

This result obtained implies for further assessment on Kinta River by reducing the number of stations needed to be monitored for the whole network. 3 sampling stations are sufficient for monitoring process. The cost for sampling also can be reduce and provides a better solution to improve river water quality. It showed that this clustering process by HACA is very useful in offering reliable information to determine the water quality for the whole region.

Tree diagram from cluster analysis in Q-mode for water samples of Kinta River shows three distinct groupings.

Cluster 2

Cluster 1

Cluster 3

Figure 4.1

Descriptive Statistic

Referring to the six sampling sites manned by DOE in the present report, station 2PK22 and 2PK24 are situated at upstream river, in the region of slightly low river pollution. Meanwhile, station 2PK25 and 2PK34 are located in the middle stream area where the river pollution is extremely high. In the meantime, station 2PK19 and 2PK33 are located in the region of downstream river, an area with moderate river pollution.

Water quality are varies depends on the types of land use activities. In probability theory and statistics, the variance is used to measure how far a set of numbers are spread out from each other, describing how far the numbers lie from the mean (expected value) either the actual probability distribution of an observed population of numbers, or the theoretical probability distribution of a not-fully-observed population of numbers. The variance of a distribution is the mean, of the squared deviation of that variable from its expected value or mean. Thus the variance is a measure of the amount of variation of the values of that variable, taking into account of all possible values and their probabilities or weightings and not just the extremes which give the range.

Based on the Table 4.1.1 and Table 4.1.2 below, in the upper stream and middle stream region, variable such as As, Hg, Cd, Cr, and Pb showed none variation in the data sets as well as in downstream region except for variable As that shows very little variation. This may due to the consistency of the data sets for all five variables along Kinta River. Meanwhile, in the upstream river region, the descriptive statistic shows that standard deviation for SS, COND, TUR, DS, TS, Cl, and Na are extremely high (Table 4.1.1).

Standard deviation is a widely used measurement of variability or diversity used in statistics and probability theory. It refers to how much variation or "dispersion" there is from the mean. A high standard deviation for variables as stated before shows that the data are spread out over a large range of values. Whereas a low standard deviation indicates that the data points tend to be very close to the mean, in the other words, it refers to a very little dispersion from mean value. Technically, the standard deviation of the data set is the square root of its variance, which will be discussed also in the present report.

In the present report, a highest standard deviation value for E-coli was observed at downstream river compared with upstream and middle stream area. This indicated that E-coli data sets at downstream were very far away from the mean value. Meanwhile, the standard deviation which shown number zero, had appeared at upstream, middle stream and downstream river except for station 2PK33 for variables Hg (Table 4.1.1, Table 4.1.2 and Table 4.1.3). This may due to the existence of only one data value measured for Hg. Thus, standard deviation was same as the mean value. Table 4.1.1, Table 4.1.2 and Table 4.1.3 below showed the summarized descriptive statistics of 5 years data set for all six stations along Kinta River.

Table 4.1.1 Summarized descriptive statistic for upstream river.

Variable

Station 2PK22

Station 2PK24

Mean

SD

Sample Variance

Mean

SD

Sample Variance

DO

7.9650

1.0499

1.1023

6.2250

1.4405

2.0750

BOD

1.3500

0.4894

0.2395

8.3000

15.0581

226.7474

COD

22.8500

7.6038

57.8184

42.1500

32.2854

1042.3450

SS

240.2500

328.5395

107938.2000

251.4500

409.4256

167629.3000

pH

6.9640

0.2887

0.0833

7.2170

0.2706

0.0732

NH3-NL

0.2410

0.8747

0.7650

0.2600

0.3569

0.1274

TEMP

26.1235

2.9835

8.9013

26.6595

2.0822

4.3355

COND

822.7350

3523.8830

12417753.0000

110.6550

67.2350

4520.5410

SAL

0.4720

2.0567

4.2301

0.0490

0.0316

0.0010

TUR

259.5100

298.7566

89255.5200

208.0750

257.7616

66441.0400

DS

392.8000

1686.9870

2845926.0000

50.5500

25.9463

673.2079

TS

633.0500

1674.6100

2804312.0000

302.0000

399.4760

159581.0000

NO3

0.2270

0.1278

0.0163

0.5723

0.4720

0.2228

Cl

221.1750

983.5938

967456.8000

3.9750

3.2545

10.5915

PO4

0.1663

0.4728

0.2235

0.1278

0.4801

0.2305

As

0.0020

0.0019

0.0000

0.0025

0.0016

0.0000

Hg

0.0001

0.0000

0.0000

0.0001

0.0000

0.0000

Cd

0.0006

0.0002

0.0000

0.0006

0.0002

0.0000

Cr

0.0012

0.0014

0.0000

0.0029

0.0041

0.0000

Pb

0.0053

0.0011

0.0000

0.0055

0.0015

0.0000

Zn

0.0304

0.0130

0.0002

0.0449

0.0331

0.0011

Ca

6.5080

23.1950

538.0083

7.2830

7.6191

58.0505

Fe

0.4244

0.3535

0.1250

0.4006

0.3386

0.1146

K

5.6000

18.5284

343.3019

3.9695

2.8152

7.9256

Mg

12.5716

54.0073

2916.7929

1.1220

0.5280

0.2788

Na

111.2835

479.8701

230275.2687

6.9625

4.4235

19.5674

OG

0.5250

0.1118

0.0125

0.5000

0.0000

MBAS

0.0250

0.0000

0.0000

0.0250

0.0000

0.0000

E-coli

6571.0000

7571.6076

57329241.0526

24620.0000

25271.7983

638663789.4737

Coliform

29730.0000

32035.1632

1026251684.2105

100040.0000

120877.5039

14611370947.3684

Table 4.1.2 Summarized descriptive statistic for middle stream river.

Variable

Station 2PK25

Station 2PK34

Mean

SD

Sample Variance

Mean

SD

Sample Variance

DO

2.8960

1.5680

2.4585

3.5235

1.3662

1.8664

BOD

6.2500

3.8508

14.8290

4.4000

2.2337

4.9895

COD

39.4000

11.4818

131.8316

35.7000

14.4445

208.6421

SS

257.1000

465.0055

216230.1000

143.9500

121.4251

14744.0500

pH

6.9330

0.3020

0.0912

7.0015

0.2322

0.0539

NH3-NL

1.0990

1.0212

1.0429

1.0853

0.9105

0.8290

TEMP

28.0830

1.6496

2.7210

29.0615

1.7126

2.9330

COND

222.2650

126.2357

15935.4500

176.2750

51.9880

2702.7570

SAL

0.1025

0.0685

0.0047

0.0790

0.0249

0.0006

TUR

235.4400

400.0680

160054.4000

150.2250

120.6443

14555.0600

DS

99.4500

72.1865

5210.8921

90.3000

33.1854

1101.2740

TS

356.5500

451.7171

204048.4000

234.2500

116.0249

13461.7800

NO3

0.7208

0.5887

0.3465

0.6118

0.2121

0.0450

Cl

9.8500

16.2619

264.4500

7.9500

5.1959

26.9974

PO4

0.1785

0.4952

0.2452

0.1330

0.4408

0.1943

As

0.0118

0.0117

0.0001

0.0117

0.0104

0.0001

Hg

0.0001

0.0000

0.0000

0.0001

0.0000

0.0000

Cd

0.0006

0.0002

0.0000

0.0006

0.0002

0.0000

Cr

0.0042

0.0058

0.0000

0.0045

0.0056

0.0000

Pb

0.0053

0.0011

0.0000

0.0055

0.0015

0.0000

Zn

0.0658

0.0554

0.0031

0.0369

0.0202

0.0004

Ca

18.3705

10.9555

120.0239

18.5225

7.0208

49.2910

Fe

0.3376

0.3913

0.1531

0.2296

0.2502

0.0626

K

4.7720

2.2540

5.0805

4.4966

1.9021

3.6181

Mg

2.5120

1.2496

1.5616

2.6517

0.9069

0.8225

Na

11.9975

14.5249

210.9740

9.6846

4.4999

20.2487

OG

0.0000

0.5000

0.0000

0.5000

0.0000

0.0000

MBAS

0.0250

0.0000

0.0000

0.0250

0.0000

0.0000

E-coli

39755.0000

68124.5259

4640951026.3158

24360.0000

26038.8657

678022526.3158

Coliform

93475.0000

105500.8599

11130431447.3684

72715.0000

71928.2704

5173676078.9474

Table 4.1.3 Summarized descriptive statistics for downstream river.

Variable

Station 2PK19

Station 2PK33

Mean

SD

Sample Variance

Mean

SD

Sample Variance

DO

2.9645

0.8604

0.7404

4.9510

1.4366

2.0637

BOD

2.4500

0.8870

0.7868

2.6000

1.9574

3.8316

COD

28.5000

10.4806

109.8421

29.2500

13.4785

181.6711

SS

64.7000

60.6735

3681.2740

115.4500

145.1828

21078.0500

pH

6.7740

0.2788

0.0777

6.9465

0.2858

0.0817

NH3-NL

0.1368

0.1891

0.0358

0.3288

0.5432

0.2951

TEMP

28.7815

1.1635

1.3537

28.0810

1.7052

2.9078

COND

124.7200

34.1352

1165.2100

102.1950

58.3370

3403.1990

SAL

0.0535

0.0179

0.0003

0.0440

0.0280

0.0008

TUR

82.4000

67.7138

4585.1580

123.1700

148.0190

21909.6300

DS

57.2500

15.5931

243.1447

49.7500

27.7430

769.6711

TS

121.9500

59.7094

3565.2100

165.2000

147.3362

21707.9600

NO3

0.5770

0.3270

0.1069

0.4800

0.1767

0.0312

Cl

4.7500

6.5604

43.0395

3.6000

5.6768

32.2263

PO4

0.1285

0.3697

0.1367

0.2018

0.4954

0.2455

As

0.0068

0.0056

0.0000

0.0055

0.0053

0.0000

Hg

0.0001

0.0000

0.0000

0.0001

0.0002

0.0000

Cd

0.0005

0.0001

0.0000

0.0005

0.0001

0.0000

Cr

0.0032

0.0035

0.0000

0.0025

0.0028

0.0000

Pb

0.0058

0.0018

0.0000

0.0053

0.0011

0.0000

Zn

0.0326

0.0152

0.0002

0.0314

0.0148

0.0002

Ca

10.8748

4.3071

18.5510

9.4816

5.1402

26.4213

Fe

0.3960

0.3464

0.1200

0.2753

0.2026

0.0410

K

3.3406

0.8558

0.7324

2.4735

1.1719

1.3734

Mg

2.1914

0.6200

0.3844

1.8161

0.9122

0.8321

Na

5.8165

2.2581

5.0990

4.8384

2.3720

5.6264

OG

0.5000

0.0000

0.0000

0.0000

0.5000

0.0000

MBAS

0.0250

0.0000

0.0000

0.0250

0.0000

0.0000

E-coli

3522.5000

5975.4426

35705914.4737

9115.0000

8827.6644

77927657.8947

Coliform

49825.0000

92226.4711

8505721973.6842

49740.0000

65661.3192

4311408842

Discriminant Analysis

The grouping made by cluster analysis before were then confirm by Discriminant Analysis using three types of DA which are Standard Mode, Forward Stepwise Mode and Backward Stepwise Mode . Standard mode consider all parameters, backward stepwise mode is remove variable by variable from less significant until resulting zero significant. While forward stepwise mode, the parameter are include one by one from the most significant until no significant obtained (Juahir et.,al 2010). The clustering were then treated as dependent variables while water quality as the independent variables ( Juahir.H et al 2010). From the result obtained in Figure 4.2, the correct percentage for standard,forward stepwise and backward stepwise were 91.67% ( Ten discriminant variables ),75.83% ( Five discriminant variables) and 90.00% (seven discriminant variables).

With the high percentage of backward stepwise mode were then selected. Firstly, regions assigned by DA for HPS, there are 40 observations where 35 are correctly class as HPS and 5 of the observations is wrongly class into HPS where it should be class under MPS due to the similarity characteristic. Next is regions assigned by DA for LPS. It has total observations of 40 with correctly assign to LPS class is 37 and 3 observations are wrong classifies and should be 1 observation under HPS and 2 Observation under MPS regions. The other 40 observations are assigned to MPS with correct division is 36 into MPS region, 4 are wrong and should be 3 under HPS and 1 under LPS region.

Figure 4.2: Classification matrix for DA of spatial variations in Kinta River

Sample Regions

% Correct

Regions Assigned by DA

HPS

LPS

MPS

Standard DA Modes ( 10 Variables)

HPS

90.00%

36

0

4

LPS

90.00%

1

36

3

MPS

95.00%

1

1

38

Total

91.67%

38

37

45

Forward Stepwise Mode ( 5 Variables)

HPS

67.50%

27

1

12

LPS

82.50%

2

33

5

MPS

77.50%

4

5

31

Total

75.83%

33

39

48

Backward Stepwise Mode ( 7 Variables)

HPS

87.50%

35

0

5

LPS

92.50%

1

37

2

MPS

90.00%

3

1

36

Total

90.00%

39

38

43

DA standard

DA (Forward)

DA (Backward)

Backward Stepwise mode was then selected to use for further discussion. Backward stepwise Mode had identified DO, NH3-NL, As,Cr,Zn,Ca, and E-coli as the most significant parameters contribute to WQI in Kinta River with 90.00% of correct percentage. This means that all this parameters have high variation in terms of their spatial distribution. All the significant parameters for standard, backward and forward DA were showed in figure 4.3.

Most Significant Parameters.

Variable

standard

backward

forward

p-value

p-value

p-value

DO

< 0.0001

< 0.0001

< 0.0001

pH

0.002

NH3-NL

< 0.0001

< 0.0001

< 0.0001

TEMP

< 0.0001

NO3

0.007

As

< 0.0001

< 0.0001

< 0.0001

Cr

0.041

0.041

Zn

0.017

0.017

0.017

Ca

< 0.0001

< 0.0001

E-coli

0.002

0.002

0.002

Figure 4.3

There are various activities were carried out along the Kinta River such as industries, irrigation, agriculture and also residential area. All this activities is the major contributor to river pollution. Seven parameters have been identified before using backward stepwise mode DA which acts as the main significant parameters in polluting Kinta River. The parameters are DO, NH3-NL, As,Cr,Zn,Ca, and E-coli. From the table above, the p value for seven parameters are < 0.0001 for DO, < 0.0001 for NH3-NL, < 0.0001 for As, 0.041 for Cr, 0.017 for Zn, < 0.0001 for Ca and 0.002 for E-Coli. All the p-values are less than 0.05 which is the significant level by using 95%. (buat correlation untuk parameters ini)

Dissolve Oxygen which known as DO. It can best describe as the total of oxygen molecules dissolved in water. The amount of dissolve oxygen in water is based on the organic compound in water body. When organic compound increase, dissolve oxygen will decrease. This because, the process of decompose organic waste by bacteria needs oxygen. In this case the organic compounds in water can be affecting by human activities such as discharge untreated waste into river, runoff from dairies, feedlots, and other agricultural operations along Kinta River.

Ammonia nitrogen (NH3-NL) basely came from agricultural runoff where fertilizer use in agriculture activities and industrial discharge (Fisher et.,al 2000). Arsenic (As) is came from the sources like glass and wood. Chromium (Cr) mainly contaminated into water body through Electroplating, leather tanning, and textile industries.This is respectively from industries. Zinc (Zn) usually use as roof for houses. Because of the Kinta River flows across the city, a lot of houses and buildings at that area can be the sources of this pollutant in water. Zn also can be mobilizing easily into stream and atmosphere when contacted with acid rain or smog. Next is Calcium. Ca can be the component of water hardness where it naturally occurs in river water due to the landuse mostly related to agriculture and forest area. While E-coli are mainly come from the sources related to wastewater treatment plant and municipal sewage (Frenzel and Couvillion 2002) also from animal husbandry and oxidation pond. Box plot for seven parameters above are constructed and shown in figure 4.4.

Box Plots

In descriptive statistics, a box plot or box plot (also known as a box-and-whisker diagram or plot) is a convenient way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any, might be considered outliers. In this exercise, box plot are used in order to graphically shows the descriptive statistic and also to identified the outliers in every variables for each stations. The box plot is interpreted as shown in the Figure 1 below which the box itself contain the middle 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile of the data set, and the lower hinge indicates the 25th percentile. The range of the middle two quartiles is known as the inter-quartile range. The straight horizontal line in the box represent the median value of the data while the plus sign " +" indicates the means of the data sets. The ends of the vertical lines or whiskers represent the minimum and maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range. Thus, the points outside the ends of the whiskers are outliers or suspected outliers which will be discussed in detail in discussion later.

Figure 1 : Box Plot diagram

Box plot in figure 4.4 are the distribution pattern of 7 parameters which contribute to WQI in Kinta River.

Figure 4.4(a)

The box plot showed the distribution for LPS,MPS and HPS class. For LPS, the mean is the highest compare to MPS and HPS. Low concentration of DO mean, the water is polluted and high concentration of DO mean, the water is clean. Too much organic matter in water body will cause decreasing of DO concentration. This because, the decomposing of organic matter by bacteria need oxygen.

While for NH3-NL, the maximum value for HPS class is very obvious than in MPS and LPS. The mean also showed that HPS got the highest. This distribution is normal because for the location which labeled as high pollution sources have high concentration of NH3-NL. This pollutant may come from the usage of fertilizer in agricultural areas along Kinta River. Therefore, the sources also can be the untreated wastewater discharge from industries.

Figure 4.4(b)

For Zinc, the distribution data between all three classes showed that HPS class also got the highest maximum value, mean, and median. There also 3 outliers which are unexpected high than all the values. For LPS class also has 2 outliers but it is still under range. The mean for LPS showed a little bit high than the mean for MPS. This may be due to some technical problem with the monitoring equipment or maybe there are changes in weather. Actually, Zn is used as the house roof. When it contact with smog, it will be dispersed in atmosphere. Atmosphere is one of the medium transportation for pollutant to enter the water.

Next box plot is the chromium distribution data. There are 5 outliers among the three classes with the highest outliers is for HPS class. The mean for LPS is the lowest and mean for HPS is the highest among three classes. The sources of chromium are usually from industries.

Figure 4.4(c)

The distribution for arsenic (Ar) and Calsium(Ca) shows in the box plots above. Both of the distribution is normal with the highest mean, median, and maximum value was located under HPS class. No outliers for Ar but for Ca, there is a few outliers located under LPS and HPS. The most extreme value is under LPS region. Arsenic and calcium both are naturally occurs under soils or rock. Calcium usually will determine water hardness, mean when high calcium concentration, the water became hard.

Figure 4.4(d)

Box plot above showed the distribution of E-coli data. The highest mean and median is under HPS, second is LPS and the lowest is MPS. The most extreme value is under HPS but the distribution is still normal. Overall, the distribution of all parameters showed in boxplot figure is normal with the highest mean for each parameters is under HPS class and the lowest mean of DO under LPS class. There are several extreme outliers for each parameter. It may cause from the changing of weather, too hot, rainy day and others. Therefore, it also may cause from the misfunctional of monitored equipments or other technical problem.

Anova Test.

Anova test was then run to see the correlation among all the significant parameters determined by backward stepwise DA in Figure 4.3. The result were showed in figure 4.5.

Source of Variation

SS

MS

F

P-value

F crit

Between Groups

33280278852

5546713142

33.97917529

8.62E-37

2.109447

Within Groups

1.35978E+11

163238604.1

Total

1.69258E+11

Figure 4.5

H null : There is no significant among all parameters

H alternative : There is significant among all parameters

From the figure above, the p-value is 8.62E-37. Because of the p-value is too small, and less than 0.05 null hypothesis should be rejected and there is significant among all the parameters determined by the backward stepwise mode before. Therefore, F-value is greater than F-crit which strongly prove that there is high relationship among seven parameters which are DO, NH3-NL, As,Cr,Zn,Ca, and E-coli in affecting WQI at Kinta River.