De multe ori m-am lovit de nevoia extragerii continutului unei pagini html, fie pentru a realiza o indexare de continut, fie a implementa o functionalitate de genul full text search. Am realizat acest lucru folosind componenta HttpClient.

Exemplu simplu de utilizare a HttpClient

public String getContentForUrl(String urlthrows Exception {
  HttpClient client = new HttpClient();
  PostMethod method = new PostMethod();

  method.setURI(new org.apache.commons.httpclient.URI(url, true));
  method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler(3false));
  client.executeMethod(method);
  return method.getResponseBodyAsString();
}

Exemplu de utilizare a HttpClient in conditiile folosirii unui server Proxy

public String getContentForUrl(String urlthrows Exception {
  HttpClient client = new HttpClient();
  PostMethod method = new PostMethod();

  method.setURI(new org.apache.commons.httpclient.URI(url, true));
  method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler(3false));
  
  //se foloseste proxy?
  if (true) {  
    client.getHostConfiguration().setProxy("proxyHost""proxyPort");
    AuthScope authScope = new AuthScope("proxyHost""proxyPort");
    UsernamePasswordCredentials proxyCredentials = new UsernamePasswordCredentials("""");
    client.getState().setProxyCredentials(authScope, proxyCredentials);
    client.getParams().setAuthenticationPreemptive(true);
  }

  client.executeMethod(method);
  return method.getResponseBodyAsString();
}