Java Mailing List Archive

http://www.junlu.com/

Home » Home (12/2007) » iText »

[iText-questions] Extract URL

anishhp

2007-06-21


i have issues in extracting embedded URL's and explicir URL's ... To read embedded URL's I modified the pdfReader.java to compare string with HTTP ...if it matches then read the string... ================================================== case PRTokeniser.TK_STRING: PdfString str = new PdfString(tokens.getStringValue(), null).setHexWriting(tokens.isHexString()); //System.out.println("PDF Embedded URLs"); if (strings != null) { strings.add(str); String content_url = str.toString(); if(content_url.length() >= 10) { //System.out.println( "content = " + content_url); content_url = content_url.toLowerCase(); if(content_url.charAt(0) == 'h' && content_url.charAt(1) == 't' && content_url.charAt(2) == 't' && content_url.charAt(3) == 'p') { itext_buffer.append(url_count+","+" ").append(str).append(" - page no - "+pagenumber).append("\n").toString(); url_count = url_count + 1; } if(content_url.charAt(0) == 'w' && content_url.charAt(1) == 'w' && content_url.charAt(2) == 'w') { //System.out.println(str.toString()); itext_buffer.append(url_count+","+" ").append(str).append(" - page no - "+pagenumber).append("\n").toString(); url_count = url_count + 1; } } //if } //if return str; case PRTokeniser.TK_NAME: if(tokens.getStringValue() != null) { if(tokens.getStringValue().equals("Page")) pagenumber = pagenumber + 1; } return new PdfName(tokens.getStringValue(), false); ==================================================================== To extract Explicit URL I am reading document page by page and extracting the URL..... ========================================================= for(int k=1;k<=pages;k++) { //urls_buffer.append(sb); sb = new StringBuffer(); arraydata = pdfreader.getPageContent(k); if(arraydata == null) return; if(arraydata != null) str = new String(arraydata); //System.out.println(str.toString()); Paragraph paragraph = new Paragraph(str.toString()); document.add(paragraph); for(i=0;i= 8) { if((string.indexOf('.') != -1)&&((string.indexOf('h') != -1)||(string.indexOf('w') != -1))) { //System.out.println("URLs "+string); for(int j=0;j= (j+3)) { if((string.charAt(j+1) == 't')&&(string.charAt(j+2) == 't')&&(string.charAt(j+3) == 'p')) { string = string.substring(j); //System.out.println(string); urls_buffer.append(urls_count+","+" ").append(string).append(" - page no - "+k).append("\n").toString(); urls_count = urls_count + 1; //System.out.println("Writing pDF"); //Paragraph paragraph = new Paragraph(string.toString()); //Anchor anchor1 = new Anchor("website (external reference)", FontFactory.getFont(FontFactory.HELVETICA, 12, Font.UNDERLINE, new Color(0, 0, 255))); break; } }//end if }//end if else { if(string.charAt(j) == 'w') { if(string.length() >= (j+2)) { if((string.charAt(j+1) == 'w')&&(string.charAt(j+2) == 'w')) { string = string.substring(j); //System.out.println(string); urls_buffer.append(urls_count+","+" ").append(string).append(" - page no - "+k).append("\n").toString(); urls_count = urls_count + 1; break; } }//end if }//end if } }//end for }//end if }//end if }// end while //System.out.println("No of Words in "+k+" Page "+words_count); }//end f My problem is that I am not able to get the URL's in PDF in a sequence manner..First it reads the whole document using PDFReader and then extracts the URL and stores in a String..Then it reads PDF page by page and then extracts Explicit URL.... Can you suggest for a better solution to this problem...How can I extract both kinds of URL together.. Regards Anish

View this message in context: Extract URL
Sent from the iText - General mailing list archive at Nabble.com.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
iText-questions mailing list
iText-questions@(protected)
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
©2008 junlu.com - Jax Systems, LLC, U.S.A.