Package org.jsoup
Class Jsoup
java.lang.Object
org.jsoup.Jsoup
public class Jsoup extends Object
The core public access point to the jsoup functionality.
- Author:
- Jonathan Hedley
-
Method Summary
Modifier and Type Method Description static Stringclean(String bodyHtml, String baseUri, Whitelist whitelist)Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.static Stringclean(String bodyHtml, String baseUri, Whitelist whitelist, Document.OutputSettings outputSettings)Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.static Stringclean(String bodyHtml, Whitelist whitelist)Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.static Connectionconnect(String url)Creates a newConnectionto a URL.static booleanisValid(String bodyHtml, Whitelist whitelist)Test if the input body HTML has only tags and attributes allowed by the Whitelist.static Documentparse(File in, String charsetName)Parse the contents of a file as HTML.static Documentparse(File in, String charsetName, String baseUri)Parse the contents of a file as HTML.static Documentparse(InputStream in, String charsetName, String baseUri)Read an input stream, and parse it to a Document.static Documentparse(InputStream in, String charsetName, String baseUri, Parser parser)Read an input stream, and parse it to a Document.static Documentparse(String html)Parse HTML into a Document.static Documentparse(String html, String baseUri)Parse HTML into a Document.static Documentparse(String html, String baseUri, Parser parser)Parse HTML into a Document, using the provided Parser.static Documentparse(URL url, int timeoutMillis)Fetch a URL, and parse it as HTML.static DocumentparseBodyFragment(String bodyHtml)Parse a fragment of HTML, with the assumption that it forms thebodyof the HTML.static DocumentparseBodyFragment(String bodyHtml, String baseUri)Parse a fragment of HTML, with the assumption that it forms thebodyof the HTML.
-
Method Details
-
parse
Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.- Parameters:
html- HTML to parsebaseUri- The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a<base href>tag.- Returns:
- sane HTML
-
parse
Parse HTML into a Document, using the provided Parser. You can provide an alternate parser, such as a simple XML (non-HTML) parser.- Parameters:
html- HTML to parsebaseUri- The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a<base href>tag.parser- alternateparserto use.- Returns:
- sane HTML
-
parse
Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a<base href>tag.- Parameters:
html- HTML to parse- Returns:
- sane HTML
- See Also:
parse(String, String)
-
connect
Creates a newConnectionto a URL. Use to fetch and parse a HTML page.Use examples:
Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();
- Parameters:
url- URL to connect to. The protocol must behttporhttps.- Returns:
- the connection. You can add data, cookies, and headers; set the user-agent, referrer, method; and then execute.
-
parse
Parse the contents of a file as HTML.- Parameters:
in- file to load HTML fromcharsetName- (optional) character set of file contents. Set tonullto determine fromhttp-equivmeta tag, if present, or fall back toUTF-8(which is often safe to do).baseUri- The URL where the HTML was retrieved from, to resolve relative links against.- Returns:
- sane HTML
- Throws:
IOException- if the file could not be found, or read, or if the charsetName is invalid.
-
parse
Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.- Parameters:
in- file to load HTML fromcharsetName- (optional) character set of file contents. Set tonullto determine fromhttp-equivmeta tag, if present, or fall back toUTF-8(which is often safe to do).- Returns:
- sane HTML
- Throws:
IOException- if the file could not be found, or read, or if the charsetName is invalid.- See Also:
parse(File, String, String)
-
parse
public static Document parse(InputStream in, String charsetName, String baseUri) throws IOExceptionRead an input stream, and parse it to a Document.- Parameters:
in- input stream to read. Make sure to close it after parsing.charsetName- (optional) character set of file contents. Set tonullto determine fromhttp-equivmeta tag, if present, or fall back toUTF-8(which is often safe to do).baseUri- The URL where the HTML was retrieved from, to resolve relative links against.- Returns:
- sane HTML
- Throws:
IOException- if the file could not be found, or read, or if the charsetName is invalid.
-
parse
public static Document parse(InputStream in, String charsetName, String baseUri, Parser parser) throws IOExceptionRead an input stream, and parse it to a Document. You can provide an alternate parser, such as a simple XML (non-HTML) parser.- Parameters:
in- input stream to read. Make sure to close it after parsing.charsetName- (optional) character set of file contents. Set tonullto determine fromhttp-equivmeta tag, if present, or fall back toUTF-8(which is often safe to do).baseUri- The URL where the HTML was retrieved from, to resolve relative links against.parser- alternateparserto use.- Returns:
- sane HTML
- Throws:
IOException- if the file could not be found, or read, or if the charsetName is invalid.
-
parseBodyFragment
Parse a fragment of HTML, with the assumption that it forms thebodyof the HTML.- Parameters:
bodyHtml- body HTML fragmentbaseUri- URL to resolve relative URLs against.- Returns:
- sane HTML document
- See Also:
Document.body()
-
parseBodyFragment
Parse a fragment of HTML, with the assumption that it forms thebodyof the HTML.- Parameters:
bodyHtml- body HTML fragment- Returns:
- sane HTML document
- See Also:
Document.body()
-
parse
Fetch a URL, and parse it as HTML. Provided for compatibility; in most cases useconnect(String)instead.The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to
UTF-8.- Parameters:
url- URL to fetch (with a GET). The protocol must behttporhttps.timeoutMillis- Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.- Returns:
- The parsed HTML.
- Throws:
MalformedURLException- if the request URL is not a HTTP or HTTPS URL, or is otherwise malformedHttpStatusException- if the response is not OK and HTTP response errors are not ignoredUnsupportedMimeTypeException- if the response mime type is not supported and those errors are not ignoredSocketTimeoutException- if the connection times outIOException- if a connection or read error occurs- See Also:
connect(String)
-
clean
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.- Parameters:
bodyHtml- input untrusted HTML (body fragment)baseUri- URL to resolve relative URLs againstwhitelist- white-list of permitted HTML elements- Returns:
- safe HTML (body fragment)
- See Also:
Cleaner.clean(Document)
-
clean
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.- Parameters:
bodyHtml- input untrusted HTML (body fragment)whitelist- white-list of permitted HTML elements- Returns:
- safe HTML (body fragment)
- See Also:
Cleaner.clean(Document)
-
clean
public static String clean(String bodyHtml, String baseUri, Whitelist whitelist, Document.OutputSettings outputSettings)Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.The HTML is treated as a body fragment; it's expected the cleaned HTML will be used within the body of an existing document. If you want to clean full documents, use
Cleaner.clean(Document)instead, and add structural tags (html, head, bodyetc) to the whitelist.- Parameters:
bodyHtml- input untrusted HTML (body fragment)baseUri- URL to resolve relative URLs againstwhitelist- white-list of permitted HTML elementsoutputSettings- document output settings; use to control pretty-printing and entity escape modes- Returns:
- safe HTML (body fragment)
- See Also:
Cleaner.clean(Document)
-
isValid
Test if the input body HTML has only tags and attributes allowed by the Whitelist. Useful for form validation.The input HTML should still be run through the cleaner to set up enforced attributes, and to tidy the output.
Assumes the HTML is a body fragment (i.e. will be used in an existing HTML document body.)
- Parameters:
bodyHtml- HTML to testwhitelist- whitelist to test against- Returns:
- true if no tags or attributes were removed; false otherwise
- See Also:
clean(String, org.jsoup.safety.Whitelist)
-