Extraction of Tag Tree Patterns with Contractible Variables from Irregular Semistructured Data (original) (raw)
Abstract
Information Extraction from semistructured data becomes more and more important. In order to extract meaningful or interesting contents from semistructured data, we need to extract common structured patterns from semistructured data. Many semistructured data have irregularities such as missing or erroneous data. A tag tree pattern is an edge labeled tree with ordered children which has tree structures of tags and structured variables. An edge label is a tag, a keyword or a wildcard, and a variable can be substituted by an arbitrary tree. Especially, a contractible variable matches any subtree including a singleton vertex. So a tag tree pattern is suited for representing common tree structured patterns in irregular semistructured data. We present a new method for extracting characteristic tag tree patterns from irregular semistructured data by using an algorithm for finding a least generalized tag tree pattern explaining given data. We report some experiments of applying this method to extracting characteristic tag tree patterns from irregular semistructured data.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
- S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000.
Google Scholar - T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. Proc. 2nd SIAM Int. Conf. Data Mining (SDM-2002), pages 158–174, 2002.
Google Scholar - C.-H. Chang, S.-C. Lui, and Y.-C. Wu. Applying pattern mining to web information extraction. Proc. PAKDD-2001, Springer-Verlag, LNAI 2035, pages 4–15, 2001.
Google Scholar - W.W. Cohen, H. Mathew, and S.J. Lee. A flexible learning system for wrapping tables and lists in HTML documents. Proc. WWW 2002, pages 1–21, 2002.
Google Scholar - N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118: 15–68, 2000.
Article MATH MathSciNet Google Scholar - T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tree structured patterns in semistructured web documents. Proc. PAKDD-2001, Springer-Verlag, LNAI 2035, pages 47–52, 2001.
Google Scholar - T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI 2336, pages 341–355, 2002.
Google Scholar - Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal structured variables from positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184, 2002.
Google Scholar - Y. Suzuki, T. Shoudai, T. Miyahara, and T. Uchida. A polynomial time matching algorithm of structured ordered tree patterns for data mining from semistructured data. Proc. ILP-2002, Springer-Verlag, LNAI (to appear), 2003.
Google Scholar - Y. Suzuki, T. Shoudai, T. Miyahara, T. Uchida, and S. Hirokawa. Polynomial time inductive inference of ordered term trees with contractible variables from positive data. Proc. LA Winter Symposium, Kyoto, Japan, pages 13-1–13-11, 2003.
Google Scholar - T. Taguchi, K. Koga, and S. Hirokawa. Integration of search sites of the World Wide Web. Proc. of CUM, Vol.2, pages 25–32, 2000.
Google Scholar - K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12:353–371, 2000.
Article Google Scholar
Author information
Authors and Affiliations
- Faculty of Information Sciences, Hiroshima City University, Hiroshima, 731-3194, Japan
Tetsuhiro Miyahara, Tomoyuki Uchida, Kenichi Takahashi & Hiroaki Ueda - Department of Informatics, Kyushu University, Kasuga, 816-8580, Japan
Yusuke Suzuki & Takayoshi Shoudai - Computing and Communications Center, Kyushu University, Fukuoka, 812-8581, Japan
Sachio Hirokawa
Authors
- Tetsuhiro Miyahara
- Yusuke Suzuki
- Takayoshi Shoudai
- Tomoyuki Uchida
- Sachio Hirokawa
- Kenichi Takahashi
- Hiroaki Ueda
Editor information
Editors and Affiliations
- Computer Science Department, Korea Advanced Institute of Science and Technology, 373-1 Koo-Sung Dong, Yoo-Sung Ku, Daejeon, 305-701, Korea
Kyu-Young Whang - Department of Statistics, Seoul National University, Sillimdong Kwanakgu, Seoul, 151-742, Korea
Jongwoo Jeon - School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
Kyuseok Shim - Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN, 55455, USA
Jaideep Srivastava
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Miyahara, T. et al. (2003). Extraction of Tag Tree Patterns with Contractible Variables from Irregular Semistructured Data. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8\_43
Download citation
- .RIS
- .ENW
- .BIB
- DOI: https://doi.org/10.1007/3-540-36175-8\_43
- Published: 30 April 2003
- Publisher Name: Springer, Berlin, Heidelberg
- Print ISBN: 978-3-540-04760-5
- Online ISBN: 978-3-540-36175-6
- eBook Packages: Springer Book Archive