2009年5月22日 星期五

mod_rewrite中的正則

文本
. 任意一個單字符
[chars] 字符類: "chars"中的任意一個字符
[^chars] 字符類: 不在"chars"中的字符
text1|text2 選擇: text1 或 text2

量詞
? 前面的字符出現 0 或 1 次
* 前面的字符出現 0 或 N 次(N > 0)
+ 前面的字符出現 1 或 N 次(N > 1)

分組
(text) text 組
(常用於設置一個選擇的邊界,或用於生成後引用:
在RewriteRule中可以用 $N 引用第N個分組)


^ 錨定到行首
$ 錨定到行尾

轉義
\c 對給定的字符c進行轉義
(比如對".[]()"進行轉義,等等)

18 Nov 2007, 16:40下午
Apache: Apache mod_rewrite rewrite 重定向
by 曹宇偉

寫評論
Apache URL重定向指南

Apache URL重定向指南

mod_rewrite入門
Apache mod_rewrite模塊是一個處理URL而又極為複雜的模塊,使用mod_rewrite你可處理所有和URL有關的問題,你所付出的就是花時間去瞭解mod_rewrite的複雜架構,一般初學者都很難實時理解mod_rewrite的用法,有時Apache專家也要mod_rewrite來發展 Apache的新功能。

換句話說,當你成功使用mod_rewrite做到你期望的東西,就不要試圖再接觸mod_rewrite了,因為mod_rewrite的功能實在過於強大。本章的例子會介紹幾個成功的例子給你摸索,不像FAQ形式般把你的問題解答。

實用解決方法
這裡還有很多未被發掘的解決方法,請大家耐心地學習如何使用mod_rewrite。

注意: 由於各人的服務器的配置都有所不同,你可能要更改設定來測試以下例子,例如使用mod_alias和mod_userdir時要加上[PT],或者使用.htaccess來重定向而非主設定文件等,請儘量理解各例子如何運作,不要生吞活剝地背誦。


URL規劃
正規URL
描述:

在某些網頁服務器中,一項資源可能會有數個URL,通常都會公佈一正規URL(即真正發放的URL),其它URL都會被視為快捷方式或只供內部使用等,無論用戶在使用快捷方式或正規URL,用戶最後所重定向到的URL必需為正規。

方法:

我們可將所有非正規的URL重定向至正規的URL中,以下例子把非正規的「/~user」換成正規的「/u/user」,並且加上「/」號結尾。.

RewriteRule ^/~([^/]+)/?(.*) /u/$1/$2 [R]
RewriteRule ^/([uge])/([^/]+)$ /$1/$2/ [R]


正規主機名稱
描述:

(省略)

方法:

RewriteCond %{HTTP_HOST} !^fully\.qualified\.domain\.name [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteCond %{SERVER_PORT} !^80$
RewriteRule ^/(.*) http://fully.qualified.domain.name:%{SERVER_PORT}/$1 [L,R]
RewriteCond %{HTTP_HOST} !^fully\.qualified\.domain\.name [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^/(.*) http://fully.qualified.domain.name/$1 [L,R]


DocumentRoot被移動
描述:

URL的「/」通常都會映像到DocumentRoot上,但DocumentRoot有時並非重始就限定在某個目錄上,它可能只是一個或多個目錄的對照而矣。例如我們的內聯網址為/e/www/ (WWW的主目錄)和/e/sww/ (內聯網的主目錄)等等,因為所有的網頁資料都放在/e/www/目錄內,我們要確定所有內嵌的圖像都能正確顯示。

方法:

我們只要把「/」重定向至「/e/www/」,用mod_rewrite來解決比用mod_alias來解決更為簡潔,因為URL別名只會比較 URL的前部分,但重定向因可能涉及另一台服務器而需要不同的前綴部分(前綴部分已受DocumentRoot限制),所以mod_rewrite是最好的解決方法::

RewriteEngine on
RewriteRule ^/$ /e/www/ [R]


結尾斜線問題
描述:

每個網主都曾受到結尾斜線問題的折磨,若在URL中沒有結尾斜線,服務器就會認為URL無效並返回錯誤,因為服務器會根據/~quux/foo去尋找foo這個檔案,而非顯示這個目錄。其實很多時候,這問題應留待用戶自己加「/」去解決,但有時你也可以完成步驟。例如你做了多次URL重定向,而目的地為一個CGI程序。

方法:

最直觀的方法就是令Apache自動加上「/」,使用外部重定向令瀏覽器能正確找到檔案,若我們只做內部重定向,就只能正確顯示目錄頁,在這目錄頁的圖像文件會因相對URL的問題而找不到。例如我們請求/~quux/foo/index.html的image.gif時,重定向後會變成/~quux /image.gif。

所以我們應使用以下方法:

RewriteEngine on
RewriteBase /~quux/
RewriteRule ^foo$ foo/ [R]


這方法也適用於.htaccess文件在各目錄內設定,但這設定會覆蓋原先主配置文件。

RewriteEngine on
RewriteBase /~quux/
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^(.+[^/])$ $1/ [R]


利用均一的URL版面規劃網絡群組
描述:

所有的網頁服務器都有相同的URL版面,即無論用戶向哪個主機發出請求URL,用戶都會接收到相同的網頁,使URL獨立於服務器本身。我們的目的在於如何在Apache服務器不能響應時,都能有一個常規(而又獨立於服務器運作)的網頁傳送給用戶,設立網絡群組可將這網頁送至遠程。

方法:

首先,服務器需要一外部文件把網站的用戶、用戶組及其它資料存儲,這文件的格式如下

user1 server_of_user1
user2 server_of_user2
: :
把以上資料存入map.xxx-to-host。然後指示服務器把URL重定向,由

/u/user/anypath
/g/group/anypath
/e/entity/anypath


http://physical-host/u/user/anypath
http://physical-host/g/group/anypath
http://physical-host/e/entity/anypath
當服務器接收到不正確的URL時,服務器會跟隨以下指示把URL映像到特定的檔案(若URL並沒有相對應的記錄,就會重定向至 server0 上):

RewriteEngine on

RewriteMap user-to-host txt:/path/to/map.user-to-host
RewriteMap group-to-host txt:/path/to/map.group-to-host
RewriteMap entity-to-host txt:/path/to/map.entity-to-host

RewriteRule ^/u/([^/]+)/?(.*) http://${user-to-host:$1|server0}/u/$1/$2
RewriteRule ^/g/([^/]+)/?(.*) http://${group-to-host:$1|server0}/g/$1/$2
RewriteRule ^/e/([^/]+)/?(.*) http://${entity-to-host:$1|server0}/e/$1/$2

RewriteRule ^/([uge])/([^/]+)/?$ /$1/$2/.www/
RewriteRule ^/([uge])/([^/]+)/([^.]+.+) /$1/$2/.www/$3\


把主目錄移到新的網頁服務器
描述:

有很多網主都有以下問題:在升級時把所有用戶主目錄由舊的服務器移到新的服務器上。

方法:

使用mod_rewrite可以簡單地解決這問題,把所有/~user/anypathURL重定向至http://newserver/~user/anypath。

RewriteEngine on
RewriteRule ^/~(.+) http://newserver/~$1 [R,L]


結構化用戶主目錄
描述:

擁有大量用戶的主機通常都會把用戶目錄規劃好,將這些目錄歸入一個父目錄中,然後再將用戶的第一個字母作該用戶的父目錄,例如/~foo /anypath將會是/home/f/foo/.www/anypath,而/~bar/anypath就是/home/b/bar/.www /anypath。

方法:

按以下指令將URL直接對映到檔案系統中。

RewriteEngine on
RewriteRule ^/~(([a-z])[a-z0-9]+)(.*) /home/$2/$1/.www$3


重新組織檔案系統
描述:

這是一個麻煩的例子:在不用更動現有目錄結構下,使用RewriteRules來顯示整個目錄結構。背景:net.sw是一個裝滿Unix免費軟件的資料夾,並以下列結構存儲:

drwxrwxr-x 2 netsw users 512 Aug 3 18:39 Audio/
drwxrwxr-x 2 netsw users 512 Jul 9 14:37 Benchmark/
drwxrwxr-x 12 netsw users 512 Jul 9 00:34 Crypto/
drwxrwxr-x 5 netsw users 512 Jul 9 00:41 Database/
drwxrwxr-x 4 netsw users 512 Jul 30 19:25 Dicts/
drwxrwxr-x 10 netsw users 512 Jul 9 01:54 Graphic/
drwxrwxr-x 5 netsw users 512 Jul 9 01:58 Hackers/
drwxrwxr-x 8 netsw users 512 Jul 9 03:19 InfoSys/
drwxrwxr-x 3 netsw users 512 Jul 9 03:21 Math/
drwxrwxr-x 3 netsw users 512 Jul 9 03:24 Misc/
drwxrwxr-x 9 netsw users 512 Aug 1 16:33 Network/
drwxrwxr-x 2 netsw users 512 Jul 9 05:53 Office/
drwxrwxr-x 7 netsw users 512 Jul 9 09:24 SoftEng/
drwxrwxr-x 7 netsw users 512 Jul 9 12:17 System/
drwxrwxr-x 12 netsw users 512 Aug 3 20:15 Typesetting/
drwxrwxr-x 10 netsw users 512 Jul 9 14:08 X11/
我們打算把這個資料夾公開,而且希望直接地顯示這資料夾的目錄結構,但是我們又不想更改現有目錄架構來遷就,加上我們打算開放給FTP,所以不想加入任何網頁或CGI程序到這個資料夾中。

方法:

本方法分為兩部分:第一部份是編寫一系列的CGI程序來顯示目錄結構,這例子會把CGI和剛才的資料夾放進/e/netsw/.www/:

-rw-r–r– 1 netsw users 1318 Aug 1 18:10 .wwwacl
drwxr-xr-x 18 netsw users 512 Aug 5 15:51 DATA/
-rw-rw-rw- 1 netsw users 372982 Aug 5 16:35 LOGFILE
-rw-r–r– 1 netsw users 659 Aug 4 09:27 TODO
-rw-r–r– 1 netsw users 5697 Aug 1 18:01 netsw-about.html
-rwxr-xr-x 1 netsw users 579 Aug 2 10:33 netsw-access.pl
-rwxr-xr-x 1 netsw users 1532 Aug 1 17:35 netsw-changes.cgi
-rwxr-xr-x 1 netsw users 2866 Aug 5 14:49 netsw-home.cgi
drwxr-xr-x 2 netsw users 512 Jul 8 23:47 netsw-img/
-rwxr-xr-x 1 netsw users 24050 Aug 5 15:49 netsw-lsdir.cgi
-rwxr-xr-x 1 netsw users 1589 Aug 3 18:43 netsw-search.cgi
-rwxr-xr-x 1 netsw users 1885 Aug 1 17:41 netsw-tree.cgi
-rw-r–r– 1 netsw users 234 Jul 30 16:35 netsw-unlimit.lst
DATA/子目錄就是剛才的資料夾,net.sw內的軟件會經rdist程序來自動更新。第二部份將這資料夾和新建立的CGI、網頁配合,我們想將 DATA/穩藏起來,而在用戶請求不同URL時執行正確的CGI程序來顯示。先將/net.sw/這URL重定向至/e/netsw:

RewriteRule ^net.sw$ net.sw/ [R]
RewriteRule ^net.sw/(.*)$ e/netsw/$1


第一條規則純粹補加URL結尾的「/」號,而第二條規則就是把URL重定向。之後將下列配置存入/e/netsw/.www/.wwwacl:

Options ExecCGI FollowSymLinks Includes MultiViews

RewriteEngine on

# we are reached via /net.sw/ prefix
RewriteBase /net.sw/

# first we rewrite the root dir to
# the handling cgi script
RewriteRule ^$ netsw-home.cgi [L]
RewriteRule ^index\.html$ netsw-home.cgi [L]

# strip out the subdirs when
# the browser requests us from perdir pages
RewriteRule ^.+/(netsw-[^/]+/.+)$ $1 [L]

# and now break the rewriting for local files
RewriteRule ^netsw-home\.cgi.* - [L]
RewriteRule ^netsw-changes\.cgi.* - [L]
RewriteRule ^netsw-search\.cgi.* - [L]
RewriteRule ^netsw-tree\.cgi$ - [L]
RewriteRule ^netsw-about\.html$ - [L]
RewriteRule ^netsw-img/.*$ - [L]

# anything else is a subdir which gets handled
# by another cgi script
RewriteRule !^netsw-lsdir\.cgi.* - [C]
RewriteRule (.*) netsw-lsdir.cgi/$1


提示:

1. 留意第四部份的L(last)旗標及代表不用更改的(』-')符號

2. 留意最後部份第一條規則的 ! (not)字符,及 C (chain) 鏈接符

3. 留意最後一條規則代表全部更新的語法

以Apache的mod_imap取代NCSA的imagemap
描述:

很多人都想順利地把舊的NCSA服務器遷至新的Apache服務器,所以我們都想將舊的NCSA imagemap順利轉換到Apache的mod_imap,問題是imagemap已被很多超級鏈接連繫著,但舊的imagemap是存儲在/cgi- bin/imagemap/path/to/page.map,而在Apache卻是放在/path/to/page.map。

方法:

我們只要將「/cgi-bin/」移除便可:

RewriteEngine on
RewriteRule ^/cgi-bin/imagemap(.*) $1 [PT]


在多個目錄下搜尋網頁
描述:

MultiViews亦不能指示Apache在多個目錄裡搜尋網頁。

方法:

請參看以下指令。

RewriteEngine on

# first try to find it in custom/…
# …and if found stop and be happy:
RewriteCond /your/docroot/dir1/%{REQUEST_FILENAME} -f
RewriteRule ^(.+) /your/docroot/dir1/$1 [L]

# second try to find it in pub/…
# …and if found stop and be happy:
RewriteCond /your/docroot/dir2/%{REQUEST_FILENAME} -f
RewriteRule ^(.+) /your/docroot/dir2/$1 [L]

# else go on for other Alias or ScriptAlias directives,
# etc.
RewriteRule ^(.+) - [PT]


跟據URL設定環境變量
描述:

在頁面間傳遞訊息可以用CGI程序完成,但你卻不想用CGI而用URL來傳遞。

方法:

以下指令將變量及其值抽出URL外,然後記入自設的環境變量中,該變量可由XSSI或CGI存取。例如把/foo/S=java/bar/轉換為/foo/bar/,然後把「java」寫入環境變量「STATUS」。

RewriteEngine on
RewriteRule ^(.*)/S=([^/]+)/(.*) $1/$3 [E=STATUS:$2]


虛擬用戶主機
描述:

你只想根據DNS記錄將www.username.host.domain.com的請求直接對映到檔案系統,放棄使用Apache的虛擬主機功能。

方法:

只有HTTP/1.1請求才可用以下方法做到,我們可根據HTTP Header把http://www.username.host.com/anypath重定向到/home/username/anypath:

RewriteEngine on
RewriteCond %{HTTP_HOST} ^www\.[^.]+\.host\.com$
RewriteRule ^(.+) %{HTTP_HOST}$1 [C]
RewriteRule ^www\.([^.]+)\.host\.com(.*) /home/$1$2


將遠程請求重定向至另一個用戶主目錄
描述:

當用者的主機不屬於自己的網域ourdomain.com時,就將請求重定向至www.somewhere.com</CODE。< dd>

方法:

請參看以下指令:

RewriteEngine on
RewriteCond %{REMOTE_HOST} !^.+\.ourdomain\.com$
RewriteRule ^(/~.+) http://www.somewhere.com/$1 [R,L]


將失敗的網頁請求重定向至另一部網頁服務器
描述:

這是一般常見的疑問,最直觀的方法就是用ErrorDocument加上CGI-scripts更改目標URL,但我們亦可使用mod_rewrite來實行(這方法的效率卻比CGI程序更低)。

方法:

再一次留意CGI會是更有效率的解決方法,而mod_rewrite的好處在於更安全及易設置:

RewriteEngine on
RewriteCond /your/docroot/%{REQUEST_FILENAME} !-f
RewriteRule ^(.+) http://webserverB.dom/$1


以上例子會限制所有網頁在DocumentRoot才能成功,我們可加多一點指令來改善:

RewriteEngine on
RewriteCond %{REQUEST_URI} !-U
RewriteRule ^(.+) http://webserverB.dom/$1


這例子使用了mod_rewrite預計URL改動的功能,所有URL都可以安全地重定向至新的目錄,但在速度慢的主機上不宜使用這方法,因為採用本例會拖慢服務器工作,當然你可以在高速CPU主機上使用。

更廣泛的URL重定向
描述:

我們想重定向有控制字符的URL,例如」url#anchor」等,通常Apache會用uri_escape()函數來隔除這些控制字符,因此你不可以直接用mod_rewrite來重定向這類URL。

方法:

我們要使用一NPH-CGI(NPH = non-parseable headers)程序處理重定向工作,因為NPH-CGI不會隔除控制字符。首先,我們先利用xredirect:

RewriteRule ^xredirect:(.+) /path/to/nph-xredirect.cgi/$1 \
[T=application/x-httpd-cgi,L]


強制性將所有URL加上xredirect,然後將URL導入nph-xredirect.cgi中,程序代碼如下:

#!/path/to/perl
##
## nph-xredirect.cgi — NPH/CGI script for extended redirects
## Copyright (c) 1997 Ralf S. Engelschall, All Rights Reserved.
##

$| = 1;
$url = $ENV{』PATH_INFO'};

print 「HTTP/1.0 302 Moved Temporarily\n」;
print 「Server: $ENV{』SERVER_SOFTWARE'}\n」;
print 「Location: $url\n」;
print 「Content-type: text/html\n」;
print 「\n」;
print 「<html>\n」;
print 「<head>\n」;
print 「<title>302 Moved Temporarily (EXTENDED)</title>\n」;
print 「</head>\n」;
print 「<body>\n」;
print 「<h1>Moved Temporarily (EXTENDED)</h1>\n」;
print 「The document has moved <a HREF=\」$url\」>here</a>.<p>\n」;
print 「</body>\n」;
print 「</html>\n」;

##EOF##


這樣可將所有能或不能直接用mod_rewrite來重定向的URL,經CGI來完成了。例如你可將某URL重定向至新聞服務器

RewriteRule ^anyurl xredirect:news:newsgroup


注意:你不需在每條規則後加上[R]或[R,L]。

多樣化資料夾存取
描述:

若你曾瀏覽http://www.perl.com/CPAN (CPAN = Comprehensive Perl Archive Network),它會把你重定向至其中一個最近你主機地區的FTP服務器,事實上這應該叫多樣化FTP存取。CPAN用CGI來實行這服務,這次我們用mod_rewrite。

<STRONG方法:< strong>

由mod_rewrite 3.0.0開始可使用「ftp:」重定向至FTP服務器,用戶主機的地區可依URL的頂層域名來決定,而頂層域名及FTP服務器位置的對照就存入某檔案中。

RewriteEngine on
RewriteMap multiplex txt:/path/to/map.cxan
RewriteRule ^/CxAN/(.*) %{REMOTE_HOST}::$1 [C]
RewriteRule ^.+\.([a-zA-Z]+)::(.*)$ ${multiplex:$1|ftp.default.dom}$2 [R,L]


##
## map.cxan — Multiplexing Map for CxAN
##

de ftp://ftp.cxan.de/CxAN/
uk ftp://ftp.cxan.uk/CxAN/
com ftp://ftp.cxan.com/CxAN/
:
##EOF##


在某段時間執行不同的重定向
描述n:

很多網主仍用CGI隨著不同時間將URL重定向至不同的網頁。

方法:

mod_rewrite設有很多以TIME_xxx開始的環境變量,將這些時間環境變量進行字符串比較可決定重定向至哪個網頁:

RewriteEngine on
RewriteCond %{TIME_HOUR}%{TIME_MIN} >0700
RewriteCond %{TIME_HOUR}%{TIME_MIN} <1900
RewriteRule ^foo\.html$ foo.day.html
RewriteRule ^foo\.html$ foo.night.html


在07:00-19:00就顯示foo.day.html,其餘時間則顯示foo.html

保留舊有文件的URL
描述:

更改文件的擴展名後,如何讓舊的URL能對映到這新的文件。

方法:

把舊的URL用mod_rewrite重定向至新的文件,若有正確的新文件就對映到這文件,沒有的話便對映到原有文件。

# backward compatibility ruleset for
# rewriting document.html to document.phtml
# when and only when document.phtml exists
# but no longer document.html
RewriteEngine on
RewriteBase /~quux/
# parse out basename, but remember the fact
RewriteRule ^(.*)\.html$ $1 [C,E=WasHTML:yes]
# rewrite to document.phtml if exists
RewriteCond %{REQUEST_FILENAME}.phtml -f
RewriteRule ^(.*)$ $1.phtml [S=1]
# else reverse the previous basename cutout
RewriteCond %{ENV:WasHTML} ^yes$
RewriteRule ^(.*)$ $1.html


內容控制
由舊的檔名轉到新的文件名 (檔案系統)
描述:

假設我們將bar.html改名為foo.html,而我們又想保留舊有的URL,甚至不想給用戶新的URL去連至這新檔案。

方法:

將舊的檔案對映到新的檔案:

RewriteEngine on
RewriteBase /~quux/
RewriteRule ^foo\.html$ bar.html


由舊的檔名轉到新的檔名 (URL)
描述:

和剛才的例子一樣,我們把bar.html改名為foo.html,但這次我們想直接將用戶的網頁重定向至新的文件,即瀏覽器的URL位置有所改變。

方法:

強制性將URL對映到新的URL:

RewriteEngine on
RewriteBase /~quux/
RewriteRule ^foo\.html$ bar.html [R]


由瀏覽器種類控制內容
描述:

一個出色的網頁應能支持各種瀏覽器,例如我們要把完整版網頁傳送至Netscape,但就要傳送文字版至Lynx。

方法:

由於瀏覽器沒有提供Apache格式的瀏覽器種類資料,所以我們不可使用內文轉換(mod_negotiation),我們必需用「User- Agent」決定瀏覽器種類。例如User-Agent為「Mozilla/3」就把「foo.html」重定向至「foo.NS.html」;若瀏覽器為「Lynx」或「Mozilla」就重定向至foo.20.html,其它種類的瀏覽器則導向至foo.32.html:

RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.*
RewriteRule ^foo\.html$ foo.NS.html [L]

RewriteCond %{HTTP_USER_AGENT} ^Lynx/.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[12].*
RewriteRule ^foo\.html$ foo.20.html [L]

RewriteRule ^foo\.html$ foo.32.html [L]


動態本地檔案更新(經鏡像網站)
描述:

你想將某個主機的網頁連結到你的網頁目錄,若被連結的是FTP服務器,你可用mirror程序將最新的檔案移到自己的主機上,我們可用 webcopy經網頁服務器HTTP把檔案下載,但這方法有一壞處:只有在執行webcopy時才能更新檔案。更好的辦法就是在發出請求時立刻找尋最新的檔案來源,然後實時下載到自己主機中。

方法:

利用Proxy Throughput(flag [P])把遠程網頁甚至整個網站建立一直接對照。

RewriteEngine on
RewriteBase /~quux/
RewriteRule ^hotsheet/(.*)$ http://www.tstimpreso.com/hotsheet/$1 [P]


RewriteEngine on
RewriteBase /~quux/
RewriteRule ^usa-news\.html$ http://www.quux-corp.com/news/index.html [P]


動態鏡像檔案更新(經本主機)
描述:

(省略)

方法:

RewriteEngine on
RewriteCond /mirror/of/remotesite/$1 -U
RewriteRule ^http://www\.remotesite\.com/(.*)$ /mirror/of/remotesite/$1


由內部網絡更新檔案
描述:

為了安全起見,我們建立了兩個網頁服務器,第一個是公開的(www.quux-corp.dom),第二個則是內部使用,受防火牆所保護,一切資料及網站維護都經這個服務器進行,現在我們想令外部服務器能存取穿過防火牆,獲取內部服務器已最新的檔案。

方法:

我們只容許外部服務器從內部獲取資料,一切直接獲取的請求都受防火牆拒絕,先在防火牆設定:

ALLOW Host www.quux-corp.dom Port >1024 –> Host www2.quux-corp.dom Port 80
DENY Host * Port * –> Host www2.quux-corp.dom Port 80


把以上的字句譯成設置防火牆的語法,然後在mod_rewrite透過proxy throughput獲取最新資料:

RewriteRule ^/~([^/]+)/?(.*) /home/$1/.www/$2
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^/home/([^/]+)/.www/?(.*) http://www2.quux-corp.dom/~$1/pub/$2 [P]


平衡服務器負荷
描述:

我們想將www[0-5].foo.com這六部服務器的工作量平均分配。

方法:

當然你會有很多方法達成,一般都會使用DNS,介紹完DNS後再會討論mod_rewrite如何實行。

1. DNS循環機制

最簡單的方法就是使用BIND的循環機制,e.g.

www0 IN A 1.2.3.1
www1 IN A 1.2.3.2
www2 IN A 1.2.3.3
www3 IN A 1.2.3.4
www4 IN A 1.2.3.5
www5 IN A 1.2.3.6


然後加上以下記錄:

www IN CNAME www0.foo.com.
IN CNAME www1.foo.com.
IN CNAME www2.foo.com.
IN CNAME www3.foo.com.
IN CNAME www4.foo.com.
IN CNAME www5.foo.com.
IN CNAME www6.foo.com.


在DNS層面上這種設定當然是錯的,但我們正好使用了BIND的循環機制,BIND接收到www.foo.com的解析請求,然後BIND就會循環地解析作www0-www6,這樣就能將用戶分配到不同的服務器上,但請記得這不是一個完美的方案,因為其它的域名服務器會快取你服務器的域名解析結果,所以每一次解析到wwwX.foo.com時,都會有很多用戶同時被派往同一部服務器,但整體來說已能平衡各服務器的負荷。

2. DNS平衡負荷

在http://www.stanford.edu/~schemers/docs/lbnamed/lbnamed.html有一個lbnamed程序專責利用域名服務器把用戶請求分發到不同的服務器上,這是一個用Perl 5及其它附助工具寫的複雜DNS工作量分配程序。

3. 代理服務器循環建立機制

我們使用mod_rewrite及其代理服務器網頁記錄(proxy throughput)功能,先在DNS加入www0.foo.com即是www.foo.com的記錄。

www IN CNAME www0.foo.com.


然後將www0.foo.com變為一獨立代理服務器,即是建立一專責代理服務器,然後把請求分流至五部不同的服務器(www1-www5),我們用lb.pl及以下mod_rewrite規則:

RewriteEngine on
RewriteMap lb prg:/path/to/lb.pl
RewriteRule ^/(.+)$ ${lb:$1} [P,L]


lb.pl的程序代碼:

#!/path/to/perl
##
## lb.pl — load balancing script
##

$| = 1;

$name = 「www」; # the hostname base
$first = 1; # the first server (not 0 here, because 0 is myself)
$last = 5; # the last server in the round-robin
$domain = 「foo.dom」; # the domainname

$cnt = 0;
while (<STDIN>) {
$cnt = (($cnt+1) % ($last+1-$first));
$server = sprintf(」%s%d.%s」, $name, $cnt+$first, $domain);
print 「http://$server/$_」;
}

##EOF##


注意,www0.foo.com這服務器的工作量仍然和以前一樣高,但這服務器的工作就只是負責分流,所有SSI、CGI、ePerl請求都由其它服務器執行,所以整體的工作量已經減少了許多。

4. 硬件/TCP循環機制

可Cisco的LocalDirector在TCP/IP網絡層上把用戶請求分流,事實上這種分流程序已刻烙在是電路板上。與硬件有關的解決方法通常都需要大量的金錢,但執行效率就會是最高。

將請求重定向至代理服務器
描述:

(省略)

方法:

##
## apache-rproxy.conf — Apache configuration for Reverse Proxy Usage
##

# server type
ServerType standalone
Port 8000
MinSpareServers 16
StartServers 16
MaxSpareServers 16
MaxClients 16
MaxRequestsPerChild 100

# server operation parameters
KeepAlive on
MaxKeepAliveRequests 100
KeepAliveTimeout 15
Timeout 400
IdentityCheck off
HostnameLookups off

# paths to runtime files
PidFile /path/to/apache-rproxy.pid
LockFile /path/to/apache-rproxy.lock
ErrorLog /path/to/apache-rproxy.elog
CustomLog /path/to/apache-rproxy.dlog 「%{%v/%T}t %h -> %{SERVER}e URL: %U」

# unused paths
ServerRoot /tmp
DocumentRoot /tmp
CacheRoot /tmp
RewriteLog /dev/null
TransferLog /dev/null
TypesConfig /dev/null
AccessConfig /dev/null
ResourceConfig /dev/null

# speed up and secure processing
<Directory />
Options -FollowSymLinks -SymLinksIfOwnerMatch
AllowOverride None
</Directory>

# the status page for monitoring the reverse proxy
<Location /apache-rproxy-status>
SetHandler server-status
</Location>

# enable the URL rewriting engine
RewriteEngine on
RewriteLogLevel 0

# define a rewriting map with value-lists where
# mod_rewrite randomly chooses a particular value
RewriteMap server rnd:/path/to/apache-rproxy.conf-servers

# make sure the status page is handled locally
# and make sure no one uses our proxy except ourself
RewriteRule ^/apache-rproxy-status.* - [L]
RewriteRule ^(http|ftp)://.* - [F]

# now choose the possible servers for particular URL types
RewriteRule ^/(.*\.(cgi|shtml))$ to://${server:dynamic}/$1 [S=1]
RewriteRule ^/(.*)$ to://${server:static}/$1

# and delegate the generated URL by passing it
# through the proxy module
RewriteRule ^to://([^/]+)/(.*) http://$1/$2 [E=SERVER:$1,P,L]

# and make really sure all other stuff is forbidden
# when it should survive the above rules…
RewriteRule .* - [F]

# enable the Proxy module without caching
ProxyRequests on
NoCache *

# setup URL reverse mapping for redirect reponses
ProxyPassReverse / http://www1.foo.dom/
ProxyPassReverse / http://www2.foo.dom/
ProxyPassReverse / http://www3.foo.dom/
ProxyPassReverse / http://www4.foo.dom/
ProxyPassReverse / http://www5.foo.dom/
ProxyPassReverse / http://www6.foo.dom/


##
## apache-rproxy.conf-servers — Apache/mod_rewrite selection table
##

# list of backend servers which serve static
# pages (HTML files and Images, etc.)
static www1.foo.dom|www2.foo.dom|www3.foo.dom|www4.foo.dom

# list of backend servers which serve dynamically
# generated page (CGI programs or mod_perl scripts)
dynamic www5.foo.dom|www6.foo.dom


建立新的檔案型態及服務
描述:

你可在網上找到大量華麗的CGI程序,但又因這些CGI的艱難用法,很多人都不願意使用,甚至Apache Action Handler的MIME類型亦On the net there are a lot of nifty CGI programs. But their usage is usually boring, so a lot of webmaster don't use them. Even Apache's Action handler feature for MIME-types is only appropriate when the CGI programs don't need special URLs (actually PATH_INFO and QUERY_STRINGS) as their input. First, let us configure a new file type with extension .scgi (for secure CGI) which will be processed by the popular cgiwrap program. The problem here is that for instance we use a Homogeneous URL Layout (see above) a file inside the user homedirs has the URL /u/user/foo/bar.scgi. But cgiwrap needs the URL in the form /~user/foo/bar.scgi/. The following rule solves the problem:

RewriteRule ^/[uge]/([^/]+)/\.www/(.+)\.scgi(.*) …
… /internal/cgi/user/cgiwrap/~$1/$2.scgi$3 [NS,T=application/x-http-cgi]


Or assume we have some more nifty programs: wwwlog (which displays the access.log for a URL subtree and wwwidx (which runs Glimpse on a URL subtree). We have to provide the URL area to these programs so they know on which area they have to act on. But usually this ugly, because they are all the times still requested from that areas, i.e. typically we would run the swwidx program from within /u/user/foo/ via hyperlink to

/internal/cgi/user/swwidx?i=/u/user/foo/
which is ugly. Because we have to hard-code both the location of the area and the location of the CGI inside the hyperlink. When we have to reorganise or area, we spend a lot of time changing the various hyperlinks.

Solution:

The solution here is to provide a special new URL format which automatically leads to the proper CGI invocation. We configure the following:

RewriteRule ^/([uge])/([^/]+)(/?.*)/\* /internal/cgi/user/wwwidx?i=/$1/$2$3/
RewriteRule ^/([uge])/([^/]+)(/?.*):log /internal/cgi/user/wwwlog?f=/$1/$2$3


Now the hyperlink to search at /u/user/foo/ reads only

HREF=」*」
which internally gets automatically transformed to

/internal/cgi/user/wwwidx?i=/u/user/foo/
The same approach leads to an invocation for the access log CGI program when the hyperlink :log gets used.

From Static to Dynamic
Description:

How can we transform a static page foo.html into a dynamic variant foo.cgi in a seemless way, i.e. without notice by the browser/user.

Solution:

We just rewrite the URL to the CGI-script and force the correct MIME-type so it gets really run as a CGI-script. This way a request to /~quux/foo.html internally leads to the invokation of /~quux/foo.cgi.

RewriteEngine on
RewriteBase /~quux/
RewriteRule ^foo\.html$ foo.cgi [T=application/x-httpd-cgi]


On-the-fly Content-Regeneration
Description:

Here comes a really esoteric feature: Dynamically generated but statically served pages, i.e. pages should be delivered as pure static pages (read from the filesystem and just passed through), but they have to be generated dynamically by the webserver if missing. This way you can have CGI-generated pages which are statically served unless one (or a cronjob) removes the static contents. Then the contents gets refreshed.

Solution:

This is done via the following ruleset:

RewriteCond %{REQUEST_FILENAME} !-s
RewriteRule ^page\.html$ page.cgi [T=application/x-httpd-cgi,L]


Here a request to page.html leads to a internal run of a corresponding page.cgi if page.html is still missing or has filesize null. The trick here is that page.cgi is a usual CGI script which (additionally to its STDOUT) writes its output to the file page.html. Once it was run, the server sends out the data of page.html. When the webmaster wants to force a refresh the contents, he just removes page.html (usually done by a cronjob).

Document With Autorefresh
Description:

Wouldn't it be nice while creating a complex webpage if the webbrowser would automatically refresh the page every time we write a new version from within our editor? Impossible?

Solution:

No! We just combine the MIME multipart feature, the webserver NPH feature and the URL manipulation power of mod_rewrite. First, we establish a new URL feature: Adding just :refresh to any URL causes this to be refreshed every time it gets updated on the filesystem.

RewriteRule ^(/[uge]/[^/]+/?.*):refresh /internal/cgi/apache/nph-refresh?f=$1


Now when we reference the URL

/u/foo/bar/page.html:refresh
this leads to the internal invocation of the URL

/internal/cgi/apache/nph-refresh?f=/u/foo/bar/page.html
The only missing part is the NPH-CGI script. Although one would usually say 「left as an exercise to the reader」 ;-) I will provide this, too.

#!/sw/bin/perl
##
## nph-refresh — NPH/CGI script for auto refreshing pages
## Copyright (c) 1997 Ralf S. Engelschall, All Rights Reserved.
##
$| = 1;

# split the QUERY_STRING variable
@pairs = split(/&/, $ENV{』QUERY_STRING'});
foreach $pair (@pairs) {
($name, $value) = split(/=/, $pair);
$name =~ tr/A-Z/a-z/;
$name = 『QS_』 . $name;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack(」C」, hex($1))/eg;
eval 「\$$name = \」$value\」";
}
$QS_s = 1 if ($QS_s eq 」);
$QS_n = 3600 if ($QS_n eq 」);
if ($QS_f eq 」) {
print 「HTTP/1.0 200 OK\n」;
print 「Content-type: text/html\n\n」;
print 「&lt;b&gt;ERROR&lt;/b&gt;: No file given\n」;
exit(0);
}
if (! -f $QS_f) {
print 「HTTP/1.0 200 OK\n」;
print 「Content-type: text/html\n\n」;
print 「&lt;b&gt;ERROR&lt;/b&gt;: File $QS_f not found\n」;
exit(0);
}

sub print_http_headers_multipart_begin {
print 「HTTP/1.0 200 OK\n」;
$bound = 「ThisRandomString12345〞;
print 「Content-type: multipart/x-mixed-replace;boundary=$bound\n」;
&print_http_headers_multipart_next;
}

sub print_http_headers_multipart_next {
print 「\n–$bound\n」;
}

sub print_http_headers_multipart_end {
print 「\n–$bound–\n」;
}

sub displayhtml {
local($buffer) = @_;
$len = length($buffer);
print 「Content-type: text/html\n」;
print 「Content-length: $len\n\n」;
print $buffer;
}

sub readfile {
local($file) = @_;
local(*FP, $size, $buffer, $bytes);
($x, $x, $x, $x, $x, $x, $x, $size) = stat($file);
$size = sprintf(」%d」, $size);
open(FP, 「&lt;$file」);
$bytes = sysread(FP, $buffer, $size);
close(FP);
return $buffer;
}

$buffer = &readfile($QS_f);
&print_http_headers_multipart_begin;
&displayhtml($buffer);

sub mystat {
local($file) = $_[0];
local($time);

($x, $x, $x, $x, $x, $x, $x, $x, $x, $mtime) = stat($file);
return $mtime;
}

$mtimeL = &mystat($QS_f);
$mtime = $mtime;
for ($n = 0; $n &lt; $QS_n; $n++) {
while (1) {
$mtime = &mystat($QS_f);
if ($mtime ne $mtimeL) {
$mtimeL = $mtime;
sleep(2);
$buffer = &readfile($QS_f);
&print_http_headers_multipart_next;
&displayhtml($buffer);
sleep(5);
$mtimeL = &mystat($QS_f);
last;
}
sleep($QS_s);
}
}

&print_http_headers_multipart_end;

exit(0);

##EOF##
Mass Virtual Hosting
Description:

The <VirtualHost> feature of Apache is nice and works great when you just have a few dozens virtual hosts. But when you are an ISP and have hundreds of virtual hosts to provide this feature is not the best choice.

Solution:

To provide this feature we map the remote webpage or even the complete remote webarea to our namespace by the use of the Proxy Throughput feature (flag [P]):

##
## vhost.map
##
www.vhost1.dom:80 /path/to/docroot/vhost1
www.vhost2.dom:80 /path/to/docroot/vhost2
:
www.vhostN.dom:80 /path/to/docroot/vhostN


##
## httpd.conf
##
:
# use the canonical hostname on redirects, etc.
UseCanonicalName on

:
# add the virtual host in front of the CLF-format
CustomLog /path/to/access_log 「%{VHOST}e %h %l %u %t \」%r\」 %>s %b」
:

# enable the rewriting engine in the main server
RewriteEngine on

# define two maps: one for fixing the URL and one which defines
# the available virtual hosts with their corresponding
# DocumentRoot.
RewriteMap lowercase int:tolower
RewriteMap vhost txt:/path/to/vhost.map

# Now do the actual virtual host mapping
# via a huge and complicated single rule:
#
# 1. make sure we don't map for common locations
RewriteCond %{REQUEST_URI} !^/commonurl1/.*
RewriteCond %{REQUEST_URI} !^/commonurl2/.*
:
RewriteCond %{REQUEST_URI} !^/commonurlN/.*
#
# 2. make sure we have a Host header, because
# currently our approach only supports
# virtual hosting through this header
RewriteCond %{HTTP_HOST} !^$
#
# 3. lowercase the hostname
RewriteCond ${lowercase:%{HTTP_HOST}|NONE} ^(.+)$
#
# 4. lookup this hostname in vhost.map and
# remember it only when it is a path
# (and not 「NONE」 from above)
RewriteCond ${vhost:%1} ^(/.*)$
#
# 5. finally we can map the URL to its docroot location
# and remember the virtual host for logging puposes
RewriteRule ^/(.*)$ %1/$1 [E=VHOST:${lowercase:%{HTTP_HOST}}]
:


Access Restriction
Blocking of Robots
Description:

How can we block a really annoying robot from retrieving pages of a specific webarea? A /robots.txt file containing entries of the 「Robot Exclusion Protocol」 is typically not enough to get rid of such a robot.

Solution:

We use a ruleset which forbids the URLs of the webarea /~quux/foo/arc/ (perhaps a very deep directory indexed area where the robot traversal would create big server load). We have to make sure that we forbid access only to the particular robot, i.e. just forbidding the host where the robot runs is not enough. This would block users from this host, too. We accomplish this by also matching the User-Agent HTTP header information.

RewriteCond %{HTTP_USER_AGENT} ^NameOfBadRobot.*
RewriteCond %{REMOTE_ADDR} ^123\.45\.67\.[8-9]$
RewriteRule ^/~quux/foo/arc/.+ - [F]


Blocked Inline-Images
Description:

Assume we have under http://www.quux-corp.de/~quux/ some pages with inlined GIF graphics. These graphics are nice, so others directly incorporate them via hyperlinks to their pages. We don't like this practice because it adds useless traffic to our server.

Solution:

While we cannot 100% protect the images from inclusion, we can at least restrict the cases where the browser sends a HTTP Referer header.

RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www.quux-corp.de/~quux/.*$ [NC]
RewriteRule .*\.gif$ - [F]


RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !.*/foo-with-gif\.html$
RewriteRule ^inlined-in-foo\.gif$ - [F]


Host Deny
Description:

How can we forbid a list of externally configured hosts from using our server?

Solution:

For Apache >= 1.3b6:

RewriteEngine on
RewriteMap hosts-deny txt:/path/to/hosts.deny
RewriteCond ${hosts-deny:%{REMOTE_HOST}|NOT-FOUND} !=NOT-FOUND [OR]
RewriteCond ${hosts-deny:%{REMOTE_ADDR}|NOT-FOUND} !=NOT-FOUND
RewriteRule ^/.* - [F]


For Apache <= 1.3b6:

RewriteEngine on
RewriteMap hosts-deny txt:/path/to/hosts.deny
RewriteRule ^/(.*)$ ${hosts-deny:%{REMOTE_HOST}|NOT-FOUND}/$1
RewriteRule !^NOT-FOUND/.* – [F]
RewriteRule ^NOT-FOUND/(.*)$ ${hosts-deny:%{REMOTE_ADDR}|NOT-FOUND}/$1
RewriteRule !^NOT-FOUND/.* – [F]
RewriteRule ^NOT-FOUND/(.*)$ /$1


##
## hosts.deny
##
## ATTENTION! This is a map, not a list, even when we treat it as such.
## mod_rewrite parses it for key/value pairs, so at least a
## dummy value 「-」 must be present for each entry.
##

193.102.180.41 -
bsdti1.sdm.de -
192.76.162.40 -


URL-Restricted Proxy
Description:

How can we restrict the proxy to allow access to a configurable set of internet sites only? The site list is extracted from a prepared bookmarks file.

Solution:

We first have to make sure mod_rewrite is below(!) mod_proxy in the Configuration file when compiling the Apache webserver (or in the AddModule list of httpd.conf in the case of dynamically loaded modules), as it must get called _before_ mod_proxy.

For simplicity, we generate the site list as a textfile map (but see the mod_rewrite documentation for a conversion script to DBM format). A typical Netscape bookmarks file can be converted to a list of sites with a shell script like this:

#!/bin/sh
cat ${1:-~/.netscape/bookmarks.html} |
tr -d 『\015′ | tr 『[A-Z]『 『[a-z]『 | grep href=\」 |
sed -e 『/href=」file:/d;』 -e 『/href=」news:/d;』 \
-e 』s|^.*href=」[^:]*://\([^:/"]*\).*$|\1 OK|;』 \
-e 『/href=」/s|^.*href=」\([^:/"]*\).*$|\1 OK|;』 |
sort -u


We redirect the resulting output into a text file called goodsites.txt. It now looks similar to this:

www.apache.org OK
xml.apache.org OK
jakarta.apache.org OK
perl.apache.org OK



We reference this site file within the configuration for the VirtualHost which is responsible for serving as a proxy (often not port 80, but 81, 8080 or 8008).

<VirtualHost *:8008>

RewriteEngine On
# Either use the (plaintext) allow list from goodsites.txt
RewriteMap ProxyAllow txt:/usr/local/apache/conf/goodsites.txt
# Or, for faster access, convert it to a DBM database:
#RewriteMap ProxyAllow dbm:/usr/local/apache/conf/goodsites
# Match lowercased hostnames
RewriteMap lowercase int:tolower
# Here we go:
# 1) first lowercase the site name and strip off a :port suffix
RewriteCond ${lowercase:%{HTTP_HOST}} ^([^:]*).*$
# 2) next look it up in the map file.
# 「%1〞 refers to the previous regex.
# If the result is 「OK」, proxy access is granted.
RewriteCond ${ProxyAllow:%1|DENY} !^OK$ [NC]
# 3) Disallow proxy requests if the site was _not_ tagged 「OK」:
RewriteRule ^proxy: - [F]

</VirtualHost>


Proxy Deny
Description:

How can we forbid a certain host or even a user of a special host from using the Apache proxy?

Solution:

We first have to make sure mod_rewrite is below(!) mod_proxy in the Configuration file when compiling the Apache webserver. This way it gets called _before_ mod_proxy. Then we configure the following for a host-dependend deny…

RewriteCond %{REMOTE_HOST} ^badhost\.mydomain\.com$
RewriteRule !^http://[^/.]\.mydomain.com.* – [F]


…and this one for a user@host-dependend deny:

RewriteCond %{REMOTE_IDENT}@%{REMOTE_HOST} ^badguy@badhost\.mydomain\.com$
RewriteRule !^http://[^/.]\.mydomain.com.* – [F]


Special Authentication Variant
Description:

Sometimes a very special authentication is needed, for instance a authentication which checks for a set of explicitly configured users. Only these should receive access and without explicit prompting (which would occur when using the Basic Auth via mod_access).

Solution:

We use a list of rewrite conditions to exclude all except our friends:

RewriteCond %{REMOTE_IDENT}@%{REMOTE_HOST} !^friend1@client1.quux-corp\.com$
RewriteCond %{REMOTE_IDENT}@%{REMOTE_HOST} !^friend2@client2.quux-corp\.com$
RewriteCond %{REMOTE_IDENT}@%{REMOTE_HOST} !^friend3@client3.quux-corp\.com$
RewriteRule ^/~quux/only-for-friends/ - [F]


Referer-based Deflector
Description:

How can we program a flexible URL Deflector which acts on the 「Referer」 HTTP header and can be configured with as many referring pages as we like?

Solution:

Use the following really tricky ruleset…

RewriteMap deflector txt:/path/to/deflector.map

RewriteCond %{HTTP_REFERER} !=」"
RewriteCond ${deflector:%{HTTP_REFERER}} ^-$
RewriteRule ^.* %{HTTP_REFERER} [R,L]

RewriteCond %{HTTP_REFERER} !=」"
RewriteCond ${deflector:%{HTTP_REFERER}|NOT-FOUND} !=NOT-FOUND
RewriteRule ^.* ${deflector:%{HTTP_REFERER}} [R,L]


… in conjunction with a corresponding rewrite map:

##
## deflector.map
##

http://www.badguys.com/bad/index.html -
http://www.badguys.com/bad/index2.html -
http://www.badguys.com/bad/index3.html http://somewhere.com/


This automatically redirects the request back to the referring page (when 「-」 is used as the value in the map) or to a specific URL (when an URL is specified in the map as the second argument).

Other
External Rewriting Engine
Description:

A FAQ: How can we solve the FOO/BAR/QUUX/etc. problem? There seems no solution by the use of mod_rewrite…

Solution:

Use an external rewrite map, i.e. a program which acts like a rewrite map. It is run once on startup of Apache receives the requested URLs on STDIN and has to put the resulting (usually rewritten) URL on STDOUT (same order!).

RewriteEngine on
RewriteMap quux-map prg:/path/to/map.quux.pl
RewriteRule ^/~quux/(.*)$ /~quux/${quux-map:$1}


#!/path/to/perl

# disable buffered I/O which would lead
# to deadloops for the Apache server
$| = 1;

# read URLs one per line from stdin and
# generate substitution URL on stdout
while (<>) {
s|^foo/|bar/|;
print $_;
}


This is a demonstration-only example and just rewrites all URLs /~quux/foo/… to /~quux/bar/…. Actually you can program whatever you like. But notice that while such maps can be used also by an average user, only the system administrator can define it.

【下列文章您可能也有興趣】

沒有留言: