Finding “outliers” or exceptions can be useful
to detect credit card fraud, telephone calling card fraud, analysis of
performance statistic of professional athletes, exploration of satellite and
medical images and many more where the occurrence patterns that are exceptions
may need special attention. Therefore, the outliers may point out surprising
and suspicious activities, extreme or relatively extreme values or observations
(or a subset of values/observations) which appear to be inconsistent with the
remainder of the set of data.
The sheer volume of data is becoming larger day
by day. For example, many companies already have data warehouses in terabytes.
Similarly, scientific data is reaching gigantic properties. While scientists
have traditionally been able to deal with small datasets containing a very
small number of attributes; dataset size and number of dimensions have proven
to be a key obstacle to the analysis of data especially when data can not be
fit in memory. Thus, implementation of data mining ideas in high performance
parallel and distributed environments is thus becoming crucial for ensuring system
scalability and interactivity as data continues to grow inexorably in size and
complexity. Similarly, mining outliers
from a huge database is a complex and time consuming problem as well as other
data mining tasks. Using parallel and distributed processing significantly can
reduce the total time required for the discovery process.
In this research, we aim to develop efficient
distributed algorithm for detecting outliers from physically distributed
computing systems.