Training content extraction models

  1. Download the training data (see above). In what follows ROOTDIR contains the root of the dragnet_data repo, another directory with similar structure (HTML and Corrected sub-directories).
  2. Create the block corrected files needed to do supervised learning on the block level. First make a sub-directory $ROOTDIR/block_corrected/ for the output files, then run:from dragnet.data_processing import extract_all_gold_standard_data rootdir = ‘/path/to/dragnet_data/’ extract_all_gold_standard_data(rootdir)This solves the longest common sub-sequence problem to determine which blocks were extracted in the gold standard. Occasionally this will fail if lxml (libxml2) cannot parse a HTML document. In this case, remove the offending document and restart the process.
  3. Use k-fold cross validation in the training set to do model selection and set any hyperparameters. Make decisions about the following:
    • Whether to use just article content or content and comments.
    • The features to use
    • The machine learning model to use
    For example, to train the randomized decision tree classifier from sklearn using the shallow text features from Kohlschuetter et al. and the CETR features from Weninger et al.:from dragnet.extractor import Extractor from dragnet.model_training import train_model from sklearn.ensemble import ExtraTreesClassifier rootdir = ‘/path/to/dragnet_data/’ features = [‘kohlschuetter’, ‘weninger’, ‘readability’] to_extract = [‘content’, ‘comments’] # or [‘content’] model = ExtraTreesClassifier( n_estimators=10, max_features=None, min_samples_leaf=75 ) base_extractor = Extractor( features=features, to_extract=to_extract, model=model ) extractor = train_model(base_extractor, rootdir)This trains the model and, if a value is passed to output_dir, writes a pickled version of it along with some some block level classification errors to a file in the specified output_dir. If no output_dir is specified, the block-level performance is printed to stdout.
  4. Once you have decided on a final model, train it on the entire training data using dragnet.model_training.train_models.
  5. As a last step, test the performance of the model on the test set (see below).

Evaluating content extraction models

Use evaluate_models_predictions in model_training to compute the token level accuracy, precision, recall, and F1. For example, to evaluate a trained model run:

from dragnet.compat import train_test_split
from dragnet.data_processing import prepare_all_data
from dragnet.model_training import evaluate_model_predictions

rootdir = '/path/to/dragnet_data/'
data = prepare_all_data(rootdir)
training_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

test_html, test_labels, test_weights = extractor.get_html_labels_weights(test_data)
train_html, train_labels, train_weights = extractor.get_html_labels_weights(training_data)

extractor.fit(train_html, train_labels, weights=train_weights)
predictions = extractor.predict(test_html)
scores = evaluate_model_predictions(test_labels, predictions, test_weights)

Note that this is the same evaluation that is run/printed in train_model