CGI Handler
LOCATION
$web/etc/handler
DESCRIPTION
CGI handler have been called as execution handler in older Pegasus manual.The file “$web/etc/handler” is a simple table format that maps requested URL path pattern to the action.
This mechanism is used to define the form of CGI file, SSI (Server Side Include) and auto-indexing service for specific directories.
Path in URI
URL syntax of rfc2616 ishttp_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
abs_path
is/path/to/document
http://host:port/path/to/document?query
/doc/path/to/document
If the URL is to a user (say alice), that is,
http://host:port/~alice/path/to/document?query
abs_path
is~alice/path/to/document
Alice's document is in separated name space from real host document.
The httpd will look the document at
/doc/path/to/document
We call the /doc/path/to/document
request path, and denote the path by $request.
If the request path ends with “/
”, then Pegasus internally appends index.html
. We call the resulting path effective request path.
This does not mean two URL
http://host/path/to/foo/
http://host/path/to/foo/index.html
/path/to/foo
is not a directory. (a file or non-existent)
Configuration
The following is the content of my configuration (http://plan9.aichi-u.ac.jp
).# path mimetype hctl execpath arg ... /netlib/*/index.html text/html 0 /bin/ftp2html /printenv/* text/plain 0 /bin/printenv $target *.http - 1 $target *.cgi text/html + $target *.html text/html 0 $target
Fig.1: CGI handler of Pegasus.
First field is a path pattern, second field is default mime type, third fields is the control level of http header by the script, and 4th field is the path to a script. The 4th field may be followed by arguments of the script.
Path patterns are compared with effective request path.
The comparison is performed from the top of lines, and stopped if a pattern is matched.
In path pattern, directory separator “/
”' is not special. ( Therefore this pattern matching is not same as that of shell. ) There is one exception: we have a rule that pattern “/*/
” matches “/
”. Therefore the pattern
/netlib/*/README
/netlib/README
as well as /netlib/cmd/rit/README
for example.
Second field denotes the default value of HTTP header “Content-Type
”. If the field is “-
”, the script must set the header.
Third field named “hctl” takes values ‘1
’,'+
', and ‘0
’ that means control level to the http headers by the script; the meanings are
1 full control by the script + partial control by the script 0 no control by the script
1
’ is specified the script has responsibility to write all http header; the script is called non-parsed CGI in CGI/1.1. HTTP headers must be separated from HTML headers by a single blank line: a line that contains only “\n” code.If ‘0’ is specified the script must not write http header. The header is provided by httpd. The output style should be
<!DOCTYPE html> <html> ... </html>
In Fig.2, the third line starting with /printenv
is combined with the script below.
#!/bin/rc rfork e echo 'ARGUMENT' for(x in $*) echo $x echo echo 'ENVIRONMENT' for(x in `{ls -p /env}){ if(test -r /env/$x) echo $x `{cat /env/$x} }
Fig.2: /bin/printenv
This script may be useful to write CGI scripts under Pegasus.
If ‘+
’ is specified the script may contain http headers in compliance with CGI/1.1. The typical output style is
Content-Type: text/html Status 200 OK <!DOCTYPE html> <html> ... </html>
Fig.3: script example.
Note that:
- Pegasus CGI of
hctl
="+
" withmimetype
="-
" corresponds to Apache CGI. However they are not the same. Pegasus CGI of this case has much more ability than Apache CGI.
- you need not separate each header by “
\r\n
”. Unix style “\n
” is OK. Pegasus cares the separators.
- if some of header names are absent in the script, then Pegasus automatically adds headers if necessary.
- if “
Content-Type
” is absent, then default value in themimetype
field will be supplied.
- if some of header names are present in the script, then those headers are used in the output. The exception is “
Content-Length
”.
- even if “
Content-Length
” is present in the script, the value is ignored and is replaced by a computed one.
- empty header block is allowed. Then you must start with empty line which separates headers block and body.
NB: Pegasus has a bug in default mimetype for
hctl
="+
".That is, Fig.3 is wrong if
mimetype
="text/html
", but OK if mimetype
="-
".This script should be OK even if
mimetype
="text/html
".This bug will be fixed in next release (Pegasus 2.8a).
Another example is shown below.
Set-Cookie: cookie=something; expire=Sun, 6-Aug-2006 11:43:57 GMT; domain=ar.aichi-u.ac.jp; path=/test4; secure <html> <head> <title>Cookie sample</title> </head> <body> ... </body> </html>
A reserved word $target
in or after 4th field denotes absolute path (in httpd space) to the requested document. That is, $target
is the path that is prefixed “/doc/
” to effective request path.
The 4th field is a path to executable program that handles the request. Note that $target
in 4th field means the effective request path is an executable program.
The second line that begin with /netlib
in Fig.1 is for
http://plan9.aichi-u.ac.jp/netlib/
Note 1: In old days, the directory was used for FTP service.
Other server such as Apache has an option to show directory index if index.html is absent. ftp2html
also does this action but does much more: if README file is present then the content is shown, and if INDEX file is present then the content is shown with appropriate action tag to the index label.
Enhanced control “*”
In supporting WebDAV, a symbol “*
” was introduced to the third field of handler for the scripts that must handle all methods. (Pegasus 2.4)
Thus, the following configuration
/dav - * /bin/foo /dav/* - * /bin/foo
http://host/dav
The meaning of symbol “*
” is same as “+
” except the scripts must handle all methods.
Meaning of other symbols ("0
", “1
” and “+
”) are kept as they have been. Only the requests with HEAD, GET and POST methods will go to these script. Other requests will be handled by Pegasus and will be rejected except for OPTIONS. You need not handle HEAD method in these script, because the request is handled by Pegasus.
In summary, difference of meaning of symbols in the third field is listed in the following table.
method | limited method | all method |
---|---|---|
simple cgi | 0 |
|
cgi/1.1 | + |
* |
non-parsed cgi | 1 |
Files that begin with “.”
Dot files (files that begin with “.
”) have been specified as “accessible only via CGI”.Now, the specification is only valid for GET,HEAD and POST method.
WebDAV must be able to handle all files including dot files*. Therefore, “
*
” in third field of handler also means to accept dot files.
Access to dot files from Mac/OSX client is annoying and causes dull response of the client. How to prevent the access? You will find some tips on the topic in next URI:
http://lists.apple.com/archives/Spotlight-dev/2006/Jun/msg00008.html
I don't know how to prevent accessing resource forks (files that begin with “._
”).
Ramfs
Ramdisk is always provided to the script, and is automatically vanished as soon as the script is finished or terminated.
A special file “...
” is internally used to compute Content-Length of output of CGI.
You need not compute Content-Length in your CGI program for HTTP/1.1.
X-CGI-Pass
An extended CGI header “X-CGI-Pass” is added.The header is really useful for scripts because it enables scripts to pass the request to the host server.
Writing codes to answer to GET request is bothersome. Why we must write the codes? Servers already have the ability to answer the request!
example
An example.if(! ~ $request */){ echo X-CGI-Pass: /doc$request echo exit }
The specification
CGI headerX-CGI-Pass: /baz
If “/baz” is equal to $target
, you may omit the name:
X-CGI-Pass:
Comparison with Apache CGI
If “text/html
” is specified for mimetype and the hctl value is ‘0’, then the format of CGI file is:<!DOCTYPE html> <html> ... </html>That is, don't start with “Content-Type:” as Apache requires:
Content-Type: text/html <!DOCTYPE html> <html> ... </html>
Apache type CGI is also supported. The file with suffix “cgi” in Fig.1 will configure CGI/1.1 for the file.
Error handling in CGI program
In case that “text/html
” is specified for “mimetype
”, Pegasus automatically send HTML headers to the client. Then response header becomes following rule:
- Message “
200 OK
” is sent if exit status is not given.
Connection will be kept
- If response header is given as exit status, then Pegasus passes it to client.
Connection will be kept, if the code number is 100 to 299; otherwise connection will be closed.
- If exit status is not a format of response header of HTTP, Pegasus will send “
500 Internal Error
” and close the connection.
keep
” or “close
” after “#
”exit '403 Forbidden # keep'
Both stdout
and stderr
are passed to client.
ENVIRONMENT VARIABLES
Pegasus has many environment variables. However most of them are only experimental. Solid variables are shown in the following:AUTH_TYPE CONTENT_LENGTH CONTENT_TYPE GATEWAY_INTERFACE PATH_INFO PATH_TRANSLATED QS_name # the name is name part in QUERY_STRING (see Note 1) QUERY_STRING REMOTE_ADDR REMOTE_HOST REMOTE_USER REQUEST_METHOD REQUEST_URI REQUEST_USER SCRIPT_NAME SERVER_NAME SERVER_PORT SERVER_PROTOCOL SERVER_SOFTWARE
HTTP_URI HTTP_SCHEME HTTP_HOST HTTP_REFERER HTTP_USER_AGENT
HTTP_HEADER
Additionally we have
request # requested path (see Note 2) home # /doc query # same as QUERY_STRING target # requested path from document root (see Note 3) name # basename of target hpid # pid of httpd that invoked the current script
Note 1: Query string is automatically decoded by the httpd. For example, a query
members&children&name=alice&age=16
QS_=(members children) QS_name=alice QS_age=16
Note 2: Path of request
might end with “/
” if it is a directory. On the other hand target
is a file that is effectively requested. target
is expressed in the notation of rc
.
target = $request # request to a file target = $request/index.html # request to a directory
Note 3: The name “target” in environment variable is confusing because the same name is used in handler in different meaning. Therefore this name should be obsolete in future.
Note 4: environment variables starting with “HTTP_” are generated from key:val pair in HTTP request header. Key is case insensitive. Current RFC states that the key may be any printable ASCII but for “:”. However allowing special characters has potential risk in handling incoming requests. Note that all keys that are currently registered to IANA consist of only alpha-numeric and ‘-’. Therefore, in generating environment variables, Pegasus-2.9 allows only keys of IANA form and converts them to uppercase and, in addition, ‘-’ to ‘_’. The latter translation is to make it easy to handle keys in shell script. This conversion rule might be or might not be broken in future.
The current working directory of invoked CGI program is the directory where the target is located.
Other environment variables might be discarded or renamed in future.
INTERNAL FLOW
erpath=$request if(test -d $erpath){ if(! ~ $erpath */){ redirect $erpath/ # which means we begin from the first by substituting # request=$erpath/ } } if(~ $erpath */) erpath=$erpath/index.html access_check $erpath handler $erpath send /doc$erpath
handler's first field is compared with $erpath
.
$target
in handler is /doc/$erpath
.
CGI/1.1
If script name is in URI
Let “foo” be an executable file. Then I will make clear values of related variables in case requests are:http://host/foo/?bar
http://host/~alice/foo/?bar
request to host document | request to user's document | decoded? | specified by | |
---|---|---|---|---|
HTTP URI | http://host/foo/?bar | http://host/~alice/foo/?bar | HTTP/1.1 | |
$HTTP_SCHEME | http | http | NO | Pegasus |
$HTTP_HOST | host | host | NO | Apache |
$REQUEST_URI | /foo/?bar | /~alice/foo/?bar | NO | Apache |
$REQUEST_USER | alice | YES | Pegasus | |
$PATH_INFO | / | / | YES | CGI/1.1 |
$PATH_TRANSLATED | /doc/ | /doc/ | YES | CGI/1.1 |
$SCRIPT_NAME | /foo | /~alice/foo | YES | CGI/1.1 |
$QUERY_STRING | bar | bar | NO | Apache |
$request | /foo/ | /foo/ | YES | Pegasus |
- The requested directory need not exist.
- The first field in handler is compared with $request.
- If request is to a directory and the URI does not end with “/”, then Pegasus redirects clients to access with the URI appended “/” at the end.
If script name is not in URI
CGI handler, or execution handler, of Pegasus is powerful. For example we can configure like this:/foo/* - + /bin/baz
On the contrary, request to Pegasus
http://host/foo/?bar
Then, what values of these environment variables should be? The answer is unclear.
CGI/1.1 specification says that concatenation $SCRIP_NAME$PATH_INFO must be a decoded path part in URI.
Therefore these values are assigned as shown below.
request to host document | request to user's document | decoded? | specified by | |
---|---|---|---|---|
HTTP URI | http://host/foo/?bar | http://host/~alice/foo/?bar | HTTP/1.1 | |
$PATH_INFO | /foo/ | /~alice/foo/ | YES | CGI/1.1 |
$PATH_TRANSLATED | /doc/foo/ | /doc/foo/ | YES | CGI/1.1 |
$SCRIPT_NAME | YES | CGI/1.1 |
Handling of POST data
If POST'ed data is once received by the server from the client,Content-Length
is checked by the server in receiving the data. Then server passes the data to CGI using stdin.
CGI TIMEOUT
Global Setting
Timeout is defined to prevent buggy programs from waiting data so long time. The value can be specified in /sys/lib/httpd.conf
. The default is 5 seconds. I think the value is enough because the data is already held by the server.
For Each CGI
Some CGIs take much time to complete the task. The time is CGI dependent.Therefore I enabled dynamical resetting of CGI timeout for each CGI.
A environment variable “hpid” is introduced for this purpose.
The “pid” is that of Pegasus in service.
example 1
In starting service, executeecho -n timeout 180 > /proc/$hpid/note
example 2
The example below written in Python is extracted from a script on my server.def settimeout(n): note="/proc/%d/note"%hpid f=open(note, "w") if f==None: print "unable to open %s"%note print "timeout is not set" return f.write("timeout %d"%n) f.close() e=os.environ hpid=0 if e.has_key("hpid"): hpid=int(e["hpid"]) if hpid: settimeout(180) # continues heavy loaded tasks