net.ricecode.similarity.JaroWinklerStrategyTest net.ricecode.similarity.JaroWinklerStrategy
1. Design Decisions
1.1. Checking of external links postponed
In the current {revision} we won’t check external links. These checks have been postponed to later versions.
1.2. HTML Parsing with jsoup
To check HTML we parse it into an internal (DOM-like) representation. For this task we use jsoup HTML parser, an open-source parser without external dependencies.
To quote from the jsoup website:
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods.
- Goals of this decision
-
Check HTML programmatically by using an existing API that provides access and finder methods to the DOM-tree of the file(s) to be checked.
- Decision Criteria
-
-
few dependencies, so the HtmlSC binary stays as small as possible.
-
accessor and finder methods to find images, links and link-targets within the DOM tree.
-
- Alternatives
-
-
HTTPUnit: a testing framework for web applications and -sites. Its main focus is web testing and it suffers from a large number of dependencies.
-
jsoup: a plain HTML parser without any dependencies (!) and a rich API to access all HTML elements in DOM-like syntax.
-
Find details on how HtmlSC implements HTML parsing in the HTML encapsulation concept.
1.3. String Similarity Checking with Jaro-Winkler-Distance
The small java-string-similarity library (by Ralph Allen Rice) contains implementations of several similarity-calculation algorithms. As it is not available as public binary, we use the sources instead, primarily:
Note
|
The actual implementation of the similarity comparison has been postponed to a later release of HtmlSC |
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.