Q:
XPATH querying a HTML page in Java: How to get the text
I'm working on Xpath, I know how to parse HTML page and retrieve information. But now I'm stuck, How can I get this whole text?
What I tried (the way is wrong):
String text = document.evaluate("//div[@class='foto']",document, XPathConstants.NODESET);
Any help please ?
A:
XPath 1.0 is a query language for XML. You should use it on an XML document. This should not work on any HTML document.
If you want to use it on HTML, use another library. For example xhtmlquery provides HTML parsing. I have no experience with it, but it may work for you.
Otherwise, use something that produces XPath 1.0 compliant XML/HTML. Like Xalan, JTidy or XSLT.
A:
If your element is within a div use xpath:
//div[. = 'fotogaleria']/div/text()
If not you can try this
//text()[contains(., 'fotogaleria')]
You could also use text() if it is directly after a < element that is not nested.
A:
If it were an xml document, then you could use XML pull parser. Something like this:
InputSource is = new InputSource("some.html");
XMLPullParser xpp = factory.newPullParser();
xpp.setInput(is);
NodeList nodes = null;
boolean event = false;
try {
event = xpp.getEventType() == XMLPullParser.Event.START_DOCUMENT;
while (event) {
switch (xpp.next()) {
case XMLPullParser.START_TAG:
if (xpp.getName().equals("img")) {
//Do your stuff
nodes = xpp.getAttributeList();
//and extract the element content by iterating over it
for(int i=0; i
T asText(final HtmlPage page, final String cssSelector) {
final HtmlTextNodeList textNodes = ((BodyContent) page.getBody()).getHtmlTextNodeListBySelector(cssSelector);
if (textNodes == null) {
return null;
} else {
return (T) textNodes.asText();
}
}
private static HtmlText extractText(final Tag html, final String cssSelector) throws CSSException, IOException {
final HtmlElement htmlElement = html.getHtmlElementById(cssSelector);
return HtmlText.valueOf(htmlElement.asText());
}