Extraction of Tag Tree Patterns with Contractible Variables from Irregular Semistructured Data (original) (raw)

Abstract

Information Extraction from semistructured data becomes more and more important. In order to extract meaningful or interesting contents from semistructured data, we need to extract common structured patterns from semistructured data. Many semistructured data have irregularities such as missing or erroneous data. A tag tree pattern is an edge labeled tree with ordered children which has tree structures of tags and structured variables. An edge label is a tag, a keyword or a wildcard, and a variable can be substituted by an arbitrary tree. Especially, a contractible variable matches any subtree including a singleton vertex. So a tag tree pattern is suited for representing common tree structured patterns in irregular semistructured data. We present a new method for extracting characteristic tag tree patterns from irregular semistructured data by using an algorithm for finding a least generalized tag tree pattern explaining given data. We report some experiments of applying this method to extracting characteristic tag tree patterns from irregular semistructured data.

Preview

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000.
    Google Scholar
  2. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. Proc. 2nd SIAM Int. Conf. Data Mining (SDM-2002), pages 158–174, 2002.
    Google Scholar
  3. C.-H. Chang, S.-C. Lui, and Y.-C. Wu. Applying pattern mining to web information extraction. Proc. PAKDD-2001, Springer-Verlag, LNAI 2035, pages 4–15, 2001.
    Google Scholar
  4. W.W. Cohen, H. Mathew, and S.J. Lee. A flexible learning system for wrapping tables and lists in HTML documents. Proc. WWW 2002, pages 1–21, 2002.
    Google Scholar
  5. N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118: 15–68, 2000.
    Article MATH MathSciNet Google Scholar
  6. T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tree structured patterns in semistructured web documents. Proc. PAKDD-2001, Springer-Verlag, LNAI 2035, pages 47–52, 2001.
    Google Scholar
  7. T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI 2336, pages 341–355, 2002.
    Google Scholar
  8. Y. Suzuki, R. Akanuma, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal structured variables from positive data. Proc. COLT-2002, Springer-Verlag, LNAI 2375, pages 169–184, 2002.
    Google Scholar
  9. Y. Suzuki, T. Shoudai, T. Miyahara, and T. Uchida. A polynomial time matching algorithm of structured ordered tree patterns for data mining from semistructured data. Proc. ILP-2002, Springer-Verlag, LNAI (to appear), 2003.
    Google Scholar
  10. Y. Suzuki, T. Shoudai, T. Miyahara, T. Uchida, and S. Hirokawa. Polynomial time inductive inference of ordered term trees with contractible variables from positive data. Proc. LA Winter Symposium, Kyoto, Japan, pages 13-1–13-11, 2003.
    Google Scholar
  11. T. Taguchi, K. Koga, and S. Hirokawa. Integration of search sites of the World Wide Web. Proc. of CUM, Vol.2, pages 25–32, 2000.
    Google Scholar
  12. K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12:353–371, 2000.
    Article Google Scholar

Download references

Author information

Authors and Affiliations

  1. Faculty of Information Sciences, Hiroshima City University, Hiroshima, 731-3194, Japan
    Tetsuhiro Miyahara, Tomoyuki Uchida, Kenichi Takahashi & Hiroaki Ueda
  2. Department of Informatics, Kyushu University, Kasuga, 816-8580, Japan
    Yusuke Suzuki & Takayoshi Shoudai
  3. Computing and Communications Center, Kyushu University, Fukuoka, 812-8581, Japan
    Sachio Hirokawa

Authors

  1. Tetsuhiro Miyahara
  2. Yusuke Suzuki
  3. Takayoshi Shoudai
  4. Tomoyuki Uchida
  5. Sachio Hirokawa
  6. Kenichi Takahashi
  7. Hiroaki Ueda

Editor information

Editors and Affiliations

  1. Computer Science Department, Korea Advanced Institute of Science and Technology, 373-1 Koo-Sung Dong, Yoo-Sung Ku, Daejeon, 305-701, Korea
    Kyu-Young Whang
  2. Department of Statistics, Seoul National University, Sillimdong Kwanakgu, Seoul, 151-742, Korea
    Jongwoo Jeon
  3. School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
    Kyuseok Shim
  4. Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN, 55455, USA
    Jaideep Srivastava

Rights and permissions

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Miyahara, T. et al. (2003). Extraction of Tag Tree Patterns with Contractible Variables from Irregular Semistructured Data. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8\_43

Download citation

Publish with us