Mining Association Rules in Time Series Data

Discovery of structures in multi-stream of data is an important problem with great significance. A dependency is an unexpectedly frequent or infrequent co-occurrence of events over time. A stream is a sequence of values produced over time. In a time series, an event is an important occurrence. The definition of an event is dependent on the time series data-mining goal. This indicates that an event on one stream is related to other events on other streams, which seems to be independent from the former event time series patterns. The stock price is a good example for such dependencies. Rise and fall of price on some stocks obviously cause price of one stock to rise and fall. If we analyze the multi-stream of time series for some stock price and we can discover dependencies between all streams, and the dependencies can help us to decide better time to buy stocks. So, the dependencies can be expressed as rules. In our case, these dependencies are called motion association rules. Strong dependencies capture structure in the streams because it indicates that there is relationship between their constituent patterns that is occurrences of those patterns are not independent. The association rule discovery problem usually translates into finding all sets of patterns of dependencies that satisfy a pre-specified minimum threshold on support, and then post-processing them to find the interesting rules. Such dependencies are called frequent. The association rules usually predict the occurrence of some other set of dependencies with certain degree of confidence.

 

l Parallel Algorithms for Mining Association Rules in Time Series Data

It is quite effective to find how current and past values in the streams of data are related to the future. However, these data sets with high dimensionality are enormous in size results in possibly large number of mined dependencies. This strongly motivates the need of efficient parallel algorithms. In this research, we introduced a parallel algorithm to discover dependency from the large amount of motion data. For example, association rule discovered from motion data about walking is “when right hand is up then the left hand and left knee are down”. Since motion data is multi-stream data of 3-D time series and the amount of data is huge and expensive. We introduce the method of extracting sequence of symbols from the time series data by using segmentation and clustering processes. To reduce the search space and speed up the process we investigate the technique to group the time series data. The experimental results conducted on a shared memory multiprocessors system justifies the inevitability of using parallel techniques for mining huge amount of data in the time series domain. As far as present research concern, we are concentrating to develop new algorithms without grouping technique as it may be constraints for other datasets in the time series data mining.

 

l Parallel Mining of Sequential Patterns in Temporal Multi-Stream

It is to mention that in above approach we need to specify the number of combinations before the discovery process starts. It infers that before commencing the discovery process, the algorithm is to be provided the information about the amount of combinations from which the rules can be discovered. From the data mining point of view, this problem is an important problem. The reason is due to the fact that data mining algorithms should not be developed for any kind of specific system or particular problem of multi-stream data. It has to be general for the problem in that domain. Thus, we are motivated to develop a new algorithm for this purpose considering the data mining goal. In the new approach, we do not want to concern about the amount of combinations. The aim is to find the rules for any number of combinations. Hence, the main problem lies on how to discover all the sequences of the body parts in minimum time that take part in any kind of motion performed by a human. To achieve the objective of parallel data mining, i.e. to speed up the discovery process, we introduce another approach of mining sequential body parts as another parallel algorithm. Due to the depth (large sequence) among the sequence of the parts that perform motions of the body in the motion database, it is complex and very time swallowing matter to determine the associations among the body parts that perform any kind of motion. The lattice based approach helps to decompose original search space into smaller pieces termed as the suffix based classes which can be processed independently in main-memory considering the advantages of the Shared Memory processors (SMP). The decomposition is recursively applied within each parent class to produce even smaller classes in the next level. As a future research we aim to develop newer algorithms.

 

l          Efficient Mining of Association Rules within Distributed Systems

Association Rule Mining in a large transactional database is an important problem in the field of knowledge discovery. When a database is partitioned among several shared nothing machines, the problem can be addressed by the distributed data mining algorithm. Several algorithms have been reported in the literature. These algorithms are able to discover good amount of rules from the datasets but the main problem lies in that they do not scale well with the number of positions or locations. The algorithms are mainly concerned with reducing the number of database scans. This reduction is necessary because the database is usually very large and is stored in secondary memory. So, basically two main things need to consider for solving the problem. They are to parallelize disk I/O for independent parallel computations as well as communicate with one another within the different parts of the database located in different machines. However, parallelizing I/O is itself an easy task as shown by the above algorithms, but still the main difficulty is sustained with communication complexity. Our research mainly aims at developing algorithm that is the most communication efficient and scalable as well. Furthermore, this research considers time series data for the purpose.

 

l Efficient Parallel Programming Techniques in a CLUMPS for Mining of Time Series Data

Message Passing Interface (MPI) is generally used as a parallel programming interface for cluster of workstations, commodity of clusters, distributed computing systems etc for the past several years. However, communication overhead is a main factor in this purpose which can not be always beneficial considering with speed up. While using in a Cluster of SMPS (CLUMPS), this may not sustain its advantage using only the MPI programming. Recently, OpenMP, a new standard of API is becoming popular due to its ease of use and efficiency in SMP programming. This research aims to develop new techniques for a collaboration of MPI and OpenMP in the purpose of data mining. MPI can be used for communication calls among the SMPS in the CLUMS and on the other hand OpenMP can be used for the internal computation within a SMP.