Re: parse website _cliptext for name, email, web URL, phone, cellphone, street, city, zip

From: gerry (germ_at_hotmail.com)
Date: 02/21/05


Date: Mon, 21 Feb 2005 14:55:33 -0500


> 1)
> http://www.jupiterarea.com/memship_roster.html

wow - now that is one absolute piece or crap web page.
every member address section is wrapped in 200+ ( just a guess ) FONT tags
!!!!
no wonder it takes this page so long to load.
and there are maybe 6 or more ( unintentionally ? ) different formatting
styles used.

anyway, I didn't look through all that code , but I have done some html
parsing ( lottery results ), if all you are doing is parsing data from the
html text, look into using CHRTRAN and STRTRAN to get rid of any unnecessary
noise from the data and STREXTRACT to 'chunk up' the data and then to zero
in on the parts that you want.

here is an actual code snippet that parses lottery results from a retreived
html page, maybe i got lucky with having to parse much cleaner data, but
this code is much much simpler and shorter that what you seem to be using :

 * data is in 1st table following FORM
 m.htm=STREXTRACT(m.htm,'<TABLE WIDTH=100% BORDER=1
BGCOLOR="#F5F5F5">',"</table>",1,3)
 IF empty(m.htm) THEN
  WAIT WINDOW m.fname+CHR(10)+"Result table not found !!"
  SET STEP ON
 ENDIF
 m.htm=STRTRAN(m.htm,"<B>","",1,999,3)
 m.htm=STRTRAN(m.htm,"</B>","",1,999,3)

 m.ohtm=m.htm
 m.htm=CHRTRAN(m.htm,CHR(9)+CHR(10)+CHR(13)," ")
 DO WHILE len(m.ohtm)#LEN(m.htm)
  m.ohtm=m.htm
  m.htm=STRTRAN(m.htm," "," ")
 ENDDO
 m.htm=STRTRAN(m.htm,"> <","><")

 * extract data one table row at a time
 LOCAL rcnt , rw , datafound
 LOCAL dt,n1,n2,n3,n4,n5,n6,b
 m.datafound=.f.
 FOR m.rcnt=1 TO 999999

  * get next row in table
  m.rw=STREXTRACT(m.htm,"<tr>","</tr>",m.rcnt,1)
  IF len(m.rw)=0 then
   EXIT
  ENDIF

  * are we interestd in this row ?
  IF LEFT(m.rw,41) == '<TD VALIGN=MIDDLE ROWSPAN=5 CLASS="body">' THEN
   * parse out fields
   m.dt=STREXTRACT(m.rw,">","<",1,1)
   m.n1=STREXTRACT(m.rw,">","<",3,1)
   m.n2=STREXTRACT(m.rw,">","<",5,1)
   m.n3=STREXTRACT(m.rw,">","<",7,1)
   m.n4=STREXTRACT(m.rw,">","<",9,1)
   m.n5=STREXTRACT(m.rw,">","<",11,1)
   m.n6=STREXTRACT(m.rw,">","<",13,1)
   m.b=STREXTRACT(m.rw,">","<",15,1)

   IF !SEEK( m.dt , "lotto649" , "dt" ) THEN
    APPEND BLANK
   ENDIF
   GATHER memvar
  ENDIF

 NEXT

"JJ" <jjyg@adelphia.net> wrote in message
news:vhck11l7bsvs19rtem7auf19eolke4e7kd@4ax.com...
> On Fri, 18 Feb 2005 23:33:39 GMT, "darrell" <someone@somewhere.com>
> wrote:
>
> >If you post some sites, I can post a little code showing a methodology to
> >scrape the page using IE.
>
> 1)
> http://www.jupiterarea.com/memship_roster.html
>
> 2)
> In this example you must seach on a category to get to the address
> page:
>
>
http://www.npbchamber.com/index.php?category=chamber&section=membership&sub=directory
>
> 3)
>
http://www.mmsi-ecom.com/rmls/roster/roster.pl?Param=rmls-roster&pass=firmSearch&FIRM=Coldwell&CITY=&ZIP=&SORT=FIRMNAME&MAX=40&WEBPAGEONLY=
>
> 4)
> http://www.switchboard.com/
> (Find a person, business, or do a reverse lookup on a phone number to
> get address, zip, phone)
>
> >BTW/ What's the programs ultimate purpose?
>
> Data collection and cleanup.
>
> Eventually this data will be imported into something like ACT!,
> Goldmine, MSCRM for purposes of marketing mortgages to realtors.
>
> I'm not much of a FoxPro programmer but grew up on dBase and still
> find it easier to manipulate this data, eliminate duplicate records,
> merge records from serveral sources, add details like cellphone,
> email, website, home address from various web sites in to Visual
> FoxPro 6 Browse Windows, using _cliptext, my mouse, the function keys
> etc.
>
> I had it pretty automated but am rewriting it in a procedure file.
>
> I have a program that opens the database several times in browse
> windows and sets relations since I bounce back and forth in the way I
> look at the data.
>
> I use the function keys like this:
>
> ***********************
> *keylabset.prg
> on key label CTRL+W ? chr(7)
> on key label CTRL+Q do appgatcomp
> on key label rightmouse do cleanup && was sbphone.prg
> on key label f1 do cleanup && sbphone
> on key label f2 do scatter
> on key label f3 do gather
> on key label f4 do appgat
> on key label f5 do scatcomp
> on key label f6 do gatcomp
> on key label f7 do appgatcomp
> on key label f8 browse last nowait
> *on key label f8 do browslave
> on key label f9 do scatrealhome
> on key label f10 do gatrealhome
> *on key label f11 do realtolast
> on key label f11 do lasttoreal
> on key label f12 do dedup
> return
> **********************
>
> Then I'm building this Procedure file bookproc.prg which will be used
> when I right click on a particular record in one of the browse windows
> after copying data from a web site to the clipboard
> **************************
> * bookproc.prg procedure file called by setup
> *********
> PROCEDURE CleanUp
> clear
> *set memowidth to 100
> public dirty, split, CutSource,
> =public VarName, mname, mphone, mcell, memail, mweb, mfax, mcompany,
> mstreet, mcity, mzip
> store "" to dirty, split, CutSource, mname, mphone, mcell, memail,
> mweb, mfax, mcompany, mstreet, mcity, mzip
> dirty=alltrim(_cliptext)
> x = ""
> FOR Ctr = 1 TO 31
> If not Ctr=13
> && ignore/keep carriage returns
> x = x + CHR(Ctr)
> && x keeps growing to include all offending characters
> Endif
> NEXT
> && next must be same as EndFor
> dirty = CHRTRAN(dirty,x,"")
> && actual replacement takes place HERE
> do ShowVar with "dirty"
> For Ctr = 1 to Memlines(dirty)
> test=mline(dirty,ctr)
> Do SplitLine
> Endfor
> do ShowVar with "split"
> For Ctr = 1 to Memlines(split)
> test=mline(split,ctr)
> Do Paste
> Endfor
> RETURN
> *********
> PROCEDURE LastToReal
> parameters plast
> && experimental code
> If rectype="R" and rlast=" " and real =" " && *
> clipx=alltrim(clipx)
> repl rlast with plast
> && with clipx
> keyboard '{tab}' && if rlast is highlighted field replacement doesn't
> finish till you tab out of the field
> keyboard '{backtab}'
> repl real with trim(substr(rlast,at(", ",rlast)+2))+" "+
> substr(rlast,1,at(",",rlast)-1)
> If not "Jr." $real and not "Sr." $real
> repl real with strtran(real,".","")
> Endif
> Endif
> ? CHR(7)
> return
> *********
> PROCEDURE Email
> parameters pmail
> && experimental code
> pmail=strtran(lower(strtran(strtran(pmail," ",""),":","")),"email","")
> *pmail=strtran(pmail,"mailto","")
> Do Case
> Case empty(remail)
> repl remail with pmail
> && clipx
> Case alltrim(remail) $pmail
> && clipx
> repl remail with pmail
> && clipx
> Case pmail $remail
> && clipx
> set bell to uhoh
> ? chr(7)
> Wait
> Wait Window pmail+" already in remail" timeout 5 && clipx
> set bell to chimes
> Case len(pmail+" "+trim(remail2nd))<=Fsize('remail2nd')
> && clipx
> repl remail2nd with pmail+" "+trim(remail2nd)
> && clipx
> Case len(pmail+" "+trim(remail))<=Fsize('remail')
> && clipx
> repl remail with pmail+" "+trim(remail)
> && clipx
> Otherwise
> set bell to uhoh
> ? chr(7)
> set bell to chimes
> _cliptext=pmail
> wait
> wait window " pmail =>_cliptext to big CTRL+V to paste ", _cliptext
> EndCase
> *********
> Procedure Fax
> parameters pfax
>
pfax=strtran(strtran(strtran(strtran(strtran(lower(pfax),"fax",""),":",""),"
)","-"),"("),"
> ","")
> do NotCoded with "Fax, memlines(split)=" +memlines(split)
> *********
> PROCEDURE NotCoded
> parameter pfrom
> do UhOh
> Wait Window "Not Coded, Procedure=", pfrom
> Return
> *********
> PROCEDURE Paste
> Do Case
> Case memlines(split)=1
>
> Do Case
> Case "@" $split or "email" $lower(split)
> do Email with split
> Case ", " $split
> do LastToReal with split
> Case "fax"$lower(split)
> do Fax with split
> Case "cell" $lower(split) or "mobile"$lower(split)
> do PhoneCell with split
> Case "Other: " $split && Other rphone from M=MMSI
> do PhoneOther with split
> Case ("Sorry" $split or "people" $split) and not empty(rzip) && S:
> no people rzip or people wrong street
> If empty(rphone)
> repl rphone with "NoS@add"
> Else
> repl remail with trim(remail) +" NoS@add"
> Endif
> ? chr(7)
> Case "match" $split and not empty(rphone)
> && S: nobody found @ rphone
> repl remail with trim(remail)+" NoSadd4Ph"
> ? chr(7)
> Case "person" $split and empty(rphone) and empty(rzip) &&
> #rphone #radd First Last Failed
> repl rphone with "NoS4Nm"
> ? chr(7)
> Case "Inquiry" $split or "Quick" $split
> repl remail with trim(remail)+" NoM", recsource with
> strtran(recsource,"M","") && get rid of M cause not on file at MMSI
> ? chr(7)
> Otherwise
> do NotCoded with "Paste, memlines(split)=1"
> Endcase
>
> Case memlines(split)=2
>
> Do Case
> Case ", "$mline(split,1) and "@"$mline(split,2)
> do LastToReal with mline(split,1)
> do email with mline(split,2)
> Otherwise
> do NotCoded with "Paste, memlines(split)=2, Otherwise"
> Endcase
>
> Case memlines(split)=3
> do NotCoded with "Paste, memlines(split)=3"
> Case memlines(split)=4
> do NotCoded with "Paste, memlines(split)=4"
> Case memlines(split)=5
> do NotCoded with "Paste, memlines(split)=5"
> Case memlines(split)=6
> do NotCoded with "Paste, memlines(split)=6"
> Case memlines(split)=7
> do NotCoded with "Paste, memlines(split)=7"
> Case memlines(split)=8
> do NotCoded with "Paste, memlines(split)=8"
> Otherwise
> do NotCoded with "Paste, Otherwise3"
> EndCase
> *********
> PROCEDURE PhoneCell
> parameters Pcell
> pcell=strtran(strtran(strtran(strtran(strtran(pcell,"
> ",""),".","-"),":",""),"(",""),")","-")
> pcell=strtran(strtran(strtran(pcell,"cellular",""),"mobile",""),"cell","")
> repl Rcellphone with pcell
> If not "W" $recsource
> repl recsource with trim(recsource)+"W"
> Endif
> ? chr(7)
> *********
>
> PROCEDURE PhoneOther
> parameter pother
> If empty(rphone) and empty(rphsource)
> repl rphsource with "M", rphone with substr(split,at("Other:
> ",pother)+7,12)
> Else
> repl remail2nd with ltrim(trim(remail2nd)+"
> M"+substr(split,at("Other: ",pother)+7,12))
> Endif
> If not "M"$recsource
> repl recsource with "M"+recsource
> Endif
> ? chr(7)
> *********
> PROCEDURE ShowVar
> && do ShowVarjj with "dirty" so you can display pshow
> parameters pshow
> ?
> ? "CutSource: ", Cutsource
> ? pshow
> local pval
> pval = evaluate(pshow)
> For Ctr = 1 to Memlines(pval)
> ? ctr, mline(pval,ctr)
> EndFor
> ENDPROC
> *********
> PROCEDURE ShowVarjj
> && do ShowVarjj with dirty (no quotes around dirty)
> parameters pshow
> ?
> ? "CutSource: ", Cutsource
> For Ctr = 1 to Memlines(pshow)
> ? ctr, mline(pshow,ctr)
> EndFor
> ?
> *********
> PROCEDURE SplitLine
> && to further parse lines containing more than one field
> Do Case
> Case ", FL " $test and "Phn: "$test &&
> West Palm Bch, FL 33401-7918 Phn: 561-832-4663
> test=strtran(test,", FL ",chr(13)+"FL"+chr(13))
> test =strtran(test," Phn: ",chr(13)+"Phn: ")
> CutSource="MFirm"
> Case ", FL " $test
> test=strtran(test,", FL ",chr(13)+"FL"+chr(13))
> CutSource="Address"
> Case "@"$test and "Office: "$test
> test=strtran(test," Office: ",chr(13)+"Office: ")
> CutSource="Mmember"
> EndCase
> If not ''=split
> split=split+chr(13)+test
> Else
> split=test
> Endif
> *********
> PROCEDURE UhOh
> set bell to UhOh
> ? chr(7)
> set bell to chimes
> *********
>
>
>
> *************************
> John "J.J." Jackson



Relevant Pages


Loading