-
Béla Bollobás,
Gautam Das,
Dimitrios Gunopulos,
and Heikki Mannila.
Time-Series Similarity Problems and Well-Separated Geometric Sets.
In Proc. of ACM Symp. on Computational Geometry (SOCG),
1997.
Keywords:
Noise Handling,
Similarity Measures,
Sequential/Temporal Data.
| Abstract: |
Given a pair of nonidentical complex objects, defining (and determining) how similar they are to each other is a nontrivial problem. In data mining applications, one frequently needs to determine the similarity between two time series. We analyze a model of time-series similarity that allows outliers, and different scaling functions. We present deterministic and randomized algorithms for computing this notion of similarity. The algorithms are based on nontrivial tools and methods from computational geometry. In particular, we use properties of families of well-separated geometric sets. The randomized algorithm has provably good performance and also works extremely efficiently in practice. |
-
Sarah Boyd.
Detecting and Describing Patterns in Time-Varying Data Using Wavelets.
In IDA97,
LNCS 1280,
pages 585--596,
1997.
Keywords:
Wavelets,
Multiscale Analysis.
| Abstract: |
Reasoning effectively about time-varying data requires sophisticated pattern detection mechanisms. This paper describes techniques developed for detecting patterns in time-varying data with the ultimate aim of generating textual descriptions of the data. Preliminary experiments are described in which the visually significant features in weather data are extracted and compared against hand-written expert descriptions. |
-
Gautam Das,
Rudolf Fleischer,
Leszek Gasieniec,
Dimitrios Gunopulos,
and Juha Kärkkäinen.
Episode Matching.
In Proceedings of Conference on Combinatorial Pattern Matching,
1997.
Keywords:
Sequential/Temporal Data.
| Abstract: |
Given two words, text $T$ of length $n$ and episode $P$ of length $m$, the episode matching problem is to find all minimal length substrings of text $T$ that contain episode $P$ as a subsequence. The respective optimization problem is to find the smallest number $w$, such that text $T$ has a subword of length $w$ which contains episode $P$. \newline In this paper, we introduce a few efficient off-line as well as on-line algorithms for the entire problem, where by on-line algorithms we mean algorithms which search from left to right consecutive text symbols only once. We present two alphabet independent algorithms which work in time $O(nm)$. The off-line algorithm operates in $O(1)$ additional space while the on-line algorithm pays for its property with $O(m)$ additional space. Two other on-line algorithms have subquadratic time complexity. One of them works in time $O(nm/\log m)$ and $O(m)$ additional space. The other one gives a time/space trade-off, i.e., it works in time $O(n+s+nm\log\log s/\log(s/m))$ when additional space is limited to $O(s)$. \newline Finally we present two approximation algorithms for the optimization problem. The off-line algorithm is alphabet independent, it has superlinear time complexity $O(n/\epsilon + n \log\log (n/m))$ and it uses only constant space. The on-line algorithm works in time $O(n/\epsilon+n)$ and uses space $O(m)$. Both approximation algorithms achieve $1+\epsilon$ approximation ratio, for any $\epsilon>0$. |
-
Gautam Das,
Dimitrios Gunopulos,
and Heikki Mannila.
Finding Similar Time Series.
In Proceedings of the Conference on Principles of Knowledge Discovery and Data Mining,
1997.
Keywords:
Sampling,
Noise Handling,
Similarity Measures,
Sequential/Temporal Data.
| Abstract: |
Similarity of objects is one of the crucial concepts in several applications, including data mining. For complex objects, similarity is nontrivial to define. In this paper we present an intuitive model for measuring the similarity between two time series. The model outliers, different scaling functions, and variable sampling rates. Using methods from computational geometry, we show that this notion of similarity can be computed in polynomial time. Using statistical approximation techniques, the algorithms can be speeded up considerably. We give preliminary experimental results that show the naturalness of the notion. |
-
Rajesh N. Davé and Sumit Sen.
On Generalizing the Noise Clustering Algorithms.
In Proceedings of the 7th Fuzzy Systems Association World Congress (IFSA'97),
volume III,
pages 205--210,
1997.
Keywords:
Noise Handling,
Clustering,
Fuzzy c-Means.
| Abstract: |
In this paper, Dav\'es noise clustering (NC) algorithm is revisited. The original NC algorithm considered noise to be a separate class, and represented it by a prototype that has the same distance $\delta$, from all the data points. Although this concept has been successful in developing a class of NC algorithms to detect a variety of cluster shapes in noisy data, use of the same constant value of the noise distance $\delta$, for all the feature vectors in the data set makes it somewhat limited in its scope. By allowing $\delta$ to take different values for different feasture vectors, the algorithm can be generalized. It can also be shown that the membership generated by NC algorithm is a product of two terms, one is the original fuzzy c-means (FCM) membership, and the other is a generalized possibilistic membership. In this light, it is shown that the NC technique is a generalization of the possibilistic technique. |
-
David J. Hand.
Intelligent Data Analysis: Issues and Opportunities.
In IDA97,
number 1280 of LNCS,
pages 1--14,
1997.
| Abstract: |
[has no abstract] Reflections on term (un)intelligent data analysis, the change of data, problems and models over time, and the role of human and computers in solving the problems. |
-
T. Honkela,
S. Kaski,
K. Lagus,
and T. Kohonen.
WEBSOM -- Self-Organizing Maps of Document Collections.
In Proc. of Workshop on Self-Organizing Maps (WSOM), June 4-6,
pages 310--315,
1997.
[ URL
]
| Abstract: |
Searching for relevant text documents has traditionally been based on keywords and Boolean expressions of them. Often the search result show high recall and low precision, or vice versa. Considerable efforts have been made to develop alternative methods, but their practical applicability has been low. Powerful methods are needed for the exploration of miscellaneous document collections. The WEBSOM method organizes a document collection on a map display that provides an overview of the collection and facilitates interactive browsing. Interesting documents can be retrieved by a content addressable search of interesting map locations. The interesting locations could also be marked as filters for collecting interesting new documents. |
-
Eamonn J. Keogh and Padhraic Smyth.
A probabilistic approach to fast pattern matching in time series databases.
In KDD97,
pages 20--24,
1997.
Keywords:
Piecewise Linear Representations,
Similarity Measures,
Sequential/Temporal Data.
| Abstract: |
The problem of efficiently and accurately locating pattern of interest in massive time series data sets is an important and non-trivial problem in a wide variety of applications, including diagnosis and monitoring of complex systems, biomedical data analysis, and exploratory data analysis in scientific and business time series. In this paper a probabilistic approach is taken to this problem. Using piecewise linear segmentations as the underlying representation, local features (such as peaks, troughs, and plateaus) are defined using a prior distribution on expected defomrations from a basic template. Global shape information is represented using another prior on the relative locations of the individual features. An appropriately defined probabilistic model integrates the local and global information and directly leads to an overall distance measure between sequence patterns based on prior knowledge. A search algorithm using this distance measure is shown to efficiently and accurately find matches for a variety of patterns on a number of data sets, including engineering sensor data from space shuttle mission archives. The proposed approach provides a natural framework to support user-customizable ``query by content'' on time series data, taking prior domain information into account in a principled manner. |
-
Eamonn J. Keogh.
A Fast and Robust Method for Pattern Matching in Time Series Databases.
In Proceedings of 9th Int. Conf. on Tools with AI (TAI 97),
1997.
Keywords:
Noise Handling,
Piecewise Linear Representations,
Similarity Measures,
Sequential/Temporal Data.
| Abstract: |
The problem of finding patterns of interest in time series databases (query by content) is an importatn one, with applications in virtually every field of science. A variety of approaches have been suggested. These approaches are robust to noise, offset translation, and amplitude scaling to varying degrees. However, they are all extremely sensitive to scaling in the time axis (longitudinal scaling). We present a method for similarity search that is robust to scaling in the time axis, in addition to noise, offset translation, and amplitude scaling. The method has been tested on medical, financial, space telemetry and artificial data. Furthermore, the method is exceptionally fast, with the predicted 2 to 4 orders of magnitude speedup actually observed. The method uses a piecewise linear representation of the orginal data. We also introduce a new algorithm which both decides the optimal number of linear segments to use, and produces the actual linear representation. |
-
Paul R. Kersten.
Implementation Issues in the Fuzzy c-Medians Algorithm.
In Proceedings of the 6th International Conference on Fuzzy Systems,
Barcelona, Spain,
pages 957--962,
1997.
Keywords:
Noise Handling,
Clustering,
Fuzzy c-Means.
| Abstract: |
The fuzzy c-Median (FcMED) clustering algorithm is an alternating optimization (AO) method of solving the fuzzy c-Means (FCM) clustering algorithm using the $L_1$ norm. This algorithm is more resistant to outliers than the FCM-AO algorithm using the $L_2$ norm. The robustness of the FCMED does not come for free, since the fuzz ymedian is the cluster-centering statistic and exact evaluation of the fuzzy median usually involves ordering the sample values. The efficiency of calculating the fuzzy median is an important implementation issue. Two other evaluation methods are considered for the fuzzy median: The first is the remedian, which statisticians use to simplify the estimation of the median. A fuzzy remedian is defined and used to approximate the fuzzy median. The second method finds the root of the derivative of the functional equation defining the fuzzy median. Both approaches are described and illustrated in this paper. |
-
Frank Klawonn and Erich-Peter Klement.
Mathematical Analysis of Fuzzy Classifiers.
In X. Liu,
P. Cohen,
and M. Berthold, editors,
IDA97,
Berlin,
pages 359--370,
1997.
Keywords:
Classification.
| Abstract: |
We examine the principle capabilities and limits of fuzzy classifiers that are based on a finite set of fuzzy if-then rules like they are used for fuzzy controllers, except that the conclusion of a rule specifies a discrete class instead of a (fuzzy) real output value. Our results show that in the two-dimensional case, for classification problem whose solutions can only be solved approximately by crisp classification rules, very simple fuzzy rules provide an exact solution. However, in the multi-dimensional case, even for linear separable problems, max-min rules are not sufficient. |
-
R. Lowen and W. Peeters.
On various classes of semi-pseudometrics used in pattern recognition.
In IFSA97,
Prague,
pages 232--237,
1997.
| Abstract: |
A general framework of using semi-pseudometrics instead of pseudometrics for comparing distances between fuzzy sets [...] will be described. We will consider various kinds of semi-pseudometrics, depending on a range of parameters and external conditions, and summarize the relationships between them. |
-
R. J. Miller and Y. Yang.
Association Rules over Interval Data.
In MD97,
Tucson, Arizona, USA,
pages 452--461,
1997.
Keywords:
Association Rules.
| Abstract: |
We consider the problem of mining association rules over interval data (that is, ordered data for which the separation between data points has meaning). We show that the measures of what rules are most important (also called rule interest) that are used for mining nominal and ordinal data do not capture the semantics of interval data. In the presence of interval data, support and confidence are no longer intuitive measures of the interest of a rule. We propose a new definition of interest for association rules that takes into account the semantics of interval data. We developed an algorithm for mining association rules under the new definition and overview our experience using the algorithm on large real-life datasets. |
-
Detlef Nauck and Rudolf Kruse.
Neuro-Fuzzy Systems for Function Approximation.
In 4th International Workshop Fuzzy-Neuro Systems 97,
Soest,
1997.
Keywords:
Classification.
| Abstract: |
We propose a neuro-fuzzy architecture for function approximation based on supervised learning. The learning algorithm is able to determine the structure and the parameters of a fuzzy system. The approach is an extension to our already published NEFCON and NEFCLASS models which are used for control or classification purposes. The proposed extended model, which we call NEFPROX, is more general and can be used for any application based on function approximation. |
-
Nikhil R. Pal,
Kuhu Pal,
and James C. Bezdek.
A Mixed c-Means Clustering Model.
In FUZZIEEE97,
pages 11-21,
1997.
Keywords:
Noise Handling,
Clustering,
Fuzzy c-Means.
| Abstract: |
We justify the need for computing both membership and typicalit values when clustering unlabeled data. Then we propose a new model called fuzzy-possibilistic c-means (FCM/PCM) models, FPCM simultaneously produces both memberships and possibilities, along with the usual point prototypes or cluster centers for each cluster We show that FPCM solves the noise sensitivity defect of FCM, and also overcomes the coincident clusters problem of PCM. Then we derive first order necessary conditions for extrema of the PFCM objective function, and use them as the basis for a standard alternating optimization approach to finding local minima. Three numerical examples are given that compare FCM to FPCM. Our calculations show that FPCM compares favorably to FCM. |
-
Lars Pickert,
Frank Klawonn,
and Edgar Wingender.
Fuzzy Cluster Analysis for Identification of Gene Regulation Regions.
In IFSA97,
Academia, Prague,
pages 56--61,
1997.
Keywords:
Clustering,
Cluster Validity Measures,
Fuzzy Clustering,
Fuzzy c-Means.
| Abstract: |
The main approach of this work is the implementation of a cluster analysis program for identification of regulatory regions in genomes. These regions are important parts of the genetic pool in higher developed organisms. They are composed of several basic elements, so called transcription factor sites, which can be identified by special analysis tools more or less vaguely. The program we have developed is able to search for two-dimensional clusters in the results of such analysis tools to give hints on gene regulatory regions. For this purpose, two fuzzy clustering algorithms have been implemented: The fuzzy c-means (FCM) and the Gath and Geva fuzzy clustering algorithm (GG) with two conventional cluster validity methods and one which has been developed especially for this application. All results of the cluster analysis program can be visualized and documented automatically. |
-
Sani Susanto,
R. D. Kennedy,
and J. W. H. dan Price.
A preliminary study of a fuzzy clustering and assignment problem-based cell formation algorithm.
In Proc. of Int. Conf. on Manufacturing Automation,
Hong Kong,
pages 95--104,
1997.
[ PDF
]
Keywords:
Clustering,
Fuzzy Clustering.
|