GitHub - lexborisov/myhtml: Fast C/C++ HTML 5 Parser. Using threads. (original) (raw)

MyHTML — a pure C HTML parser

Build Status

MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies.

Now

Important announcement!

Please use the HTML parser from the Lexbor project. It is stable, has more features, and — yes — it's very fast.

Features

Changes

Please, see CHANGELOG.md file

Further developments

Support encodings for InputStream

X_USER_DEFINED, UTF_8, UTF_16LE, UTF_16BE, BIG5, EUC_KR, GB18030,
IBM866, ISO_8859_10, ISO_8859_13, ISO_8859_14, ISO_8859_15, ISO_8859_16, ISO_8859_2, ISO_8859_3,
ISO_8859_4, ISO_8859_5, ISO_8859_6, ISO_8859_7, ISO_8859_8, KOI8_R, KOI8_U, MACINTOSH,
WINDOWS_1250, WINDOWS_1251, WINDOWS_1252, WINDOWS_1253, WINDOWS_1254, WINDOWS_1255, WINDOWS_1256,
WINDOWS_1257, WINDOWS_1258, WINDOWS_874, X_MAC_CYRILLIC, ISO_2022_JP, GBK, SHIFT_JIS, EUC_JP, ISO_8859_8_I

Support encodings for output

Program working in UTF-8 and returns all in UTF-8

Detecting character encodings

Now it UTF-8, UTF-16LE, UTF16BE and russian windows-1251, koi8-r, iso-8859-5, x-mac-cyrillic, ibm866

Installation

See INSTALL.md

Introduction

Introduction

Benchmark

Dependencies

None

External Bindings and Wrappers

Examples

See examples directory

Simple example

#include <stdio.h> #include <stdlib.h> #include <string.h>

#include <myhtml/api.h>

int main(int argc, const char * argv[]) { char html[] = "

HTML
";

// basic init
myhtml_t* myhtml = myhtml_create();
myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);

// first tree init
myhtml_tree_t* tree = myhtml_tree_create();
myhtml_tree_init(tree, myhtml);

// parse html
myhtml_parse(tree, MyENCODING_UTF_8, html, strlen(html));

// print result
// or see serialization function with callback: myhtml_serialization_tree_callback
mycore_string_raw_t str = {0};
myhtml_serialization_tree_buffer(myhtml_tree_get_document(tree), &str);
printf("%s\n", str.data);

// release resources
mycore_string_raw_destroy(&str, false);
myhtml_tree_destroy(tree);
myhtml_destroy(myhtml);

return 0;

}

AUTHOR

Alexander Borisov lex.borisov@gmail.com

Copyright (C) 2015-2018 Alexander Borisov

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

See the LICENSE file.