#!/bin/9lua -- under construction p9=require("p9") gen=require("gen") dir=require("dir") dav=require("dav") net=require("net") xio=require("xio") http=require("http") dprint = gen.printf split = str.split basename = gen.basename readf = gen.readf options = gen.options getlen = http.getlen test = dir.test sub = string.sub gsub = string.gsub byte = string.byte char = string.char match = string.match lower = string.lower find = string.find format = string.format getheader = http.getheader getfields = http.getfields xferbody = http.xferbody readchunkstr = http.readchunkstr subfield = http.subfield aread = http.aread function hdial(url,head) -- url: http://host:port/path local netaddr,dir,ctl,data,path,port,addr,proto,host,fd local s,t gsub(url,"^(%w+)://([^/]+)/?(.*)",function(p,a,b) proto = p addr = a path = "/"..b end) if addr == nil then error("url:"..url) end s = split(addr,":") host = s[1] port = s[2] if port == nil then port = pmap[proto] end if port == nil then error("protocol") end netaddr = "tcp!"..host.."!"..port print("connecting to: "..netaddr) head._host = host head._port = port head._addr = addr head._path = path head._proto = proto if proto == "http" then dir,ctl = net.dial(netaddr) -- print("DBG: hdial:", dir,ctl) data = xio.open(dir.."/data","rw") data.ctl = ctl return data,head -- p9.close(data.ctl) -- don't forget this one! end if proto == "https" then fd = p9.popen("tlsclient "..netaddr,"rw") data = xio.open() data.fd = fd return data,head end error("unsupported protocol") end function mkgetreq(head) local req,path,host path = head._path host = head._host req = "GET "..path.." HTTP/1.1\r\n" .. "Host: "..host.."\r\n" for k,v in pairs(head) do if sub(k,1,1) ~= "_" then req = req..k..": "..v .. "\r\n" end end return req.."\r\n" end function mkpostreq(head,postdata) local req, path,host path = head._path host = head._host req = "POST "..path.." HTTP/1.1\r\n" .. "Host: "..host.."\r\n" head["Content-Length"] = #postdata for k,v in pairs(head) do if sub(k,1,1) ~= "_" then req = req..k..": "..v .. "\r\n" end end return req.."\r\n"..postdata end function getchunk(fd0) -- return: real data local s,m,r,u local n = 0 s = "" repeat u,r = areadln(fd0) m = tonumber(u,16) if m > 0 then s = s..u u = areadn(fd0,m+2) -- include trailing CRLF if sub(u,-1) ~= "\n" then error("chunk") end n = n + #u -2 end until m == 0 -- xfer the extension s = s..u u,r = areadln(fd0) return s..u end function getbody(fd0,len) -- we return the xfer data -- dprint("getbody") local u local n = 0 local s = "" local size=16*1024 if len == nil then u = aread(fd0,size) while u do n = n + #u s = s..u u = aread(fd0,size) end return s end if type(len) == "string" then if len ~= "chunked" then error("Transfer-Encoding") end -- dprint("--- Chunked ---") return s..getchunk(fd0,fd1,flag) end if len > 0 then -- the transfer data might be big local m = len while m > size do u = areadn(fd0,size) -- #u == size s = s..u n = n + #u m = m - #u end u = areadn(fd0,m) s = s..u n = n + #u if len ~= n then -- this case happens in the respose from HTTP/1.1 servers with buggy header -- no "Content-Length" with content of size n > 0 error() end end return s end function hbody(data,hd) local len,f,n,p,r,b,host,path,s,suffix path = hd._path host = hd._host len = getlen(hd) print("DBG hbody len",len) s = getbody(data,len) -- dbg("#s",#s) if sub(hd._status,1,1) ~= "2" then return nil end if hd["Content-Disposition"] then b = basename(path) p = b.."/"..match(hd["Content-Disposition"],'filename="([^"]+)"') else if hd["Content-Encoding"] == "gzip" then suffix = "html.gz" else suffix = "html" end p = path if sub(p,-1) == "/" then p = p.."index."..suffix end end p = gsub("hgot/"..host..p,"//+","/") b = basename(p) dir.mkdirs(b) f,r = xio.open(p,"w+") if r then error(r) end f:write(s) f:close() -- print("saved to: "..p) return p end function nummon(s) local t t = {["Jan"]=1,["Feb"]=2,["Mar"]=3,["App"]=4,["May"]=5,["Jun"]=6, ["Jul"]=7,["Aug"]=8,["Sep"]=9,["Oct"]=10,["Nov"]=11,["Dec"]=12} return t[s] end function ns_to_isodate(s) -- convert Netscape style "Wdy, DD-Mon-YY HH:MM:SS GMT" to ISO style: 2011-03-12T01:40:19Z -- Note that we have many variant! -- google style: "expires=Sun, 13-Mar-2011 13:21:04 GMT" -- pure netscape style -- Yahoo style: "expires=Thu, 13 Mar 2008 13:14:20 GMT" -- amazon style: "expires=Fri Mar 20 07:00:00 2009 GMT" local t gsub(s,"%w+, (%d+)[- ](%w+)[- ](%d+) (%d+):(%d+):(%d+) GMT",function(day,mon,year,hour,min,sec) t = format("%04d-%02d-%02dT%02d:%02d:%02dZ",year,nummon(mon),day,hour,min,sec) end) if t then return t end gsub(s,"%w+ (%w+) (%d+) (%d+):(%d+):(%d+) (%d+) GMT",function(mon,day,hour,min,sec,year) t = format("%04d-%02d-%02dT%02d:%02d:%02dZ",year,nummon(mon),day,hour,min,sec) end) return t end function decookie(c,host,convert) -- decompose single cookie -- Ref RFC2109, RFC2965 -- RFC2965 says: "the Netscape-style cookie MUST be discarded" -- but don't mind local m,n0,n1,d,t n0,n1 = find(c,";",1,true) m = match(sub(c,1,n0-1),"^ *(.+)") t = subfield(sub(c,n1+1)) t.cookie = m -- we regulate date format to ISO style: yyyy-mm-ddThh:mm:ssZ -- ISO style is much easier to handle expire -- original format is: Wdy, DD-Mon-YY HH:MM:SS GMT if convert and t.expires then t.expires = ns_to_isodate(t.expires) end if t.path == nil then t.path = path end -- NOTE: we need security check to t.domain -- i.e., t.domain must be match with accessing host if t.domain == nil then t.domain = host end d = t.domain if sub(host,-#d) ~= d then -- bad cookie print("WARNING: bad cookie") return nil end -- accept cookie only of d: if match(d,"%.[^.]+%....*$") or match(d,"%.[^.]+%.[^.]+%...$") then return t end return nil end function save_cookie(c,host) -- save cookie to cookie file local dom,f,cookie,t,ct,lines if c == nil then return nil end lines = split(c,"\n") t = "" for i,v in ipairs(lines) do ct = decookie(v,host,true) dom = ct.domain if dom == nil then dom = host end if dom ~= host and sub(dom,1,1) ~= "." then dom = "."..dom end cookie = ct.cookie if ct.expires then c = cookie.."; " ct.cookie = nil for k,v in pairs(ct) do c = c..k.."="..v .. "; " end f = io.open(cookies.."/"..dom,"a") c = gsub(c,"; $","") f:write(c.."\n") f:close() print("cookie: saved to: "..cookies.."/"..dom) end -- print("save_cookie:cookie:",cookie) t = t..cookie.."; " end t = gsub(t,"; $","") return t end function get_cookie(host,path) -- get relevant cookies from cookie file -- example: we have two cookie files for yahoo,i,e, ".yahoo.com" and "www.yahoo.com" -- we have multiple cookie lines in each file local function read_cookie(f) print("cookie: reading from: "..f) local d,t,u,c,p d = dav.date("iso",os.time()) -- we have duplicated or redundant cookies in the file t = {} for line in io.lines(f) do c = decookie(line,host) p = c.path if c.expires and c.expires > d and sub(path,1,#p) == p then t[c.cookie] = true end end u = "" for k,v in pairs(t) do u = u .. k .. "; " end u = gsub(u,"; $","") return u end local b,e,s,t s = host t = "" while s do --print(s) if test("-e", cookies.."/"..s) then t = t .. read_cookie(cookies.."/"..s) end b,e = find(s,".", 2, true) if b then s = sub(s,b) else s = nil end end if t ~= "" then return t else return nil end end function add_cookie(cookie, x) local k,v local t = {} if cookie == nil then cookie = "" end if x == nil then x = "" end cookie = cookie .. ";"..x -- print("add_cookie:cookie:"..cookie) gsub(cookie,"([^%s;]+)",function(a) t[a] = true end) cookie = "" for k,v in pairs(t) do -- print("k="..k) cookie = cookie .. k .. "; " end cookie = gsub(cookie,"; $","") if cookie == "" then return nil end return cookie end -- hget1: get single page function hget1(url,data,head,post) -- url: http://host:port/path local req,resh,hd,cookie,saved --print("DBG: hget",url) if data == nil or data.fd == nil then data,head = hdial(url,head) end if data == nil then print("#ERROR: dial:",url,head._host,head._path) return url,data,head end host = head._host path = head._path print("DBG: hget1 path:",path) head.Cookie = add_cookie(head.Cookie,get_cookie(host,path)) if post then head["Content-Type"] = "application/x-www-form-urlencoded" req = mkpostreq(head,post) else req = mkgetreq(head) end print(req) -- basic info, request header data:write(req) resh = getheader(data) -- response header print(resh) -- basic info, respose header hd = getfields(resh) -- this is the place for authentication -- print(hd._status) -- 401 -- print(hd._reason) -- Unauthorized hd._path = head._path hd._host = head._host print("DBG: hget1 _status:",hd._status) if hd._status == "301" then -- Moved Permanently url = hd["Location"] end -- we need sweep out the body part even if error -- hbody returns nil saved = hbody(data,hd) if saved then if match(saved,"/index%.html%.gz") then saved = sub(saved,1,-4) os.execute("/bin/gunzip -c "..saved..".gz>"..saved) end end head._saved = saved print("saved to:",saved) if hd["Connection"] == "close" or sub(hd._status,1,1) ~= "2" then -- some hosts redirect us to other host without "CONNECTION: close" -- the example: "www.google.com" -> "www.google.co.jp" if data.ctl then p9.close(data.ctl) end data:close() data = nil end head.Cookie = add_cookie(head.Cookie,save_cookie(hd["Set-Cookie"],host)) -- print("cookie: passed: ",head.Cookie) head.Referer = url head._status = hd._status print("DBG: hget1 data:",data) return url,data,head end function hget(url,data,head,post) local p = head._path repeat url,data,head = hget1(url,data,head,post) until url == nil or sub(head._status,1,1) ~= "3" return url,data,head end -- wget: get references that url does not begin with "%w+://", -- that is in the same host function wget(url,data,head,recur) local u,ext,st1,n,m,p,q,s print("DBG: wget url path",url,head._path) url,data,head = hget(url,data,head) st1 = sub(head._status,1,1) print("DBG: wget st1",st1) if st1 ~= "2" then return url,data,head end saved = head._saved print("DBG: saved",saved) if not match(saved,".+%.html") then return url,data,head end s = readf(saved) p = head._path s = gsub(s,"","") -- chop comment s = gsub(s,"]%.-","") gsub(s, "<([^>]+)>",function(t) gsub(t,'href="([^"]+)"', function(a) print("DBG wget href",url,p,a) if sub(a,1,1) == "#" or match(a,"%w+://") then return url,data,head end print("DBG: wget a:",a) if sub(a,1,1) == "/" then if sub(a,1,#base) == base then head._path = a else return url,data,head end else head._path = gsub(basename(p).."/"..a,"(//+)","/") end q = head._path ext = match(a,".+%.(%w+)$") if ext then ext = lower(ext) end if ext == nil or htmext[ext] then if wgot[q] == nil then wget(url,data,head,recur) wgot[q] = true end else if hgot[q] == nil then hget1(url,data,head) hgot[q] = true end end end) gsub(t,'src="([^"]+)"',function(a) print("DBG wget src:",url,a) if match(a,"%w+://") then return url,data,head end if sub(a,1,1) == "/" then head._path = a else head._path = gsub(basename(p).."/"..a,"(//+)","/") end q = head._path if hgot[q] == nil then hget1(url,data,head) hgot[q] = true end end) end) return url,data,head end hgot = {} wgot = {} http.timeout = 30 htmext = {["html"] = true, ["htm"] = true} pmap = {["http"]="80",["https"]="443"} cookies = "cookies" -- cookie dir. ua1 = "Plan9/phget" ua2 = "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en; rv:1.8.1.14) Gecko/20080512 Camino/1.6.1 (like Firefox/2.0.0.14)" ua3 = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; ja-JP-mac; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3" head = { ["User-Agent"] = ua3, ["Content-Type"] = "text/html; charset=UTF-8", ["Accept"] = "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5", ["Accept-Encoding"] = "gzip,deflate", ["Accept-Language"] = "ja,en-US;q=0.9,en;q=0.8,fr;q=0.7,de;q=0.6,es;q=0.5,it;q=0.4,nl;q=0.3,sv;q=0.2,nb;q=0.1", ["Accept-Charset"] = "ISO-8859-1,utf-8;q=0.7,*;q=0.7", ["Connection"] = "keep-alive", ["Content-Length"] = "0", } arg,opt,r = argopt(arg,"rw:f") if r or #arg == 0 then print("usage: phget [-rw] [-f file] url") os.exit() end if opt and opt.f then file = opt.f dofile(file) os.exit() end url = arg[1] if opt and opt.r then recur = true end base = match(url,"http.?://[^/]+(.*)") if base == "" then base = "/" end if opt and opt.w then wget(url,data,head,recur) os.exit() end hget(url,data,head)