Add observability variables and provision for avoiding OCSP conflicts

[users/heiko/exim.git] / doc / doc-txt / experimental-spec.txt
diff --git a/doc/doc-txt/experimental-spec.txt b/doc/doc-txt/experimental-spec.txt

index 031c5f4c120eb38c6490629828130d753d063e24..c060a6c5a99856e2b7c77b9ccbbc767eaaced660 100644 (file)
--- a/doc/doc-txt/experimental-spec.txt
+++ b/doc/doc-txt/experimental-spec.txt
@@ -6,97 +6,6 @@ about experimental  features, all  of which  are unstable and
  liable to incompatible change.
  
  
-OCSP Stapling support
---------------------------------------------------------------
-
-X.509 PKI certificates expire and can be revoked; to handle this, the
-clients need some way to determine if a particular certificate, from a
-particular Certificate Authority (CA), is still valid.  There are three
-main ways to do so.
-
-The simplest way is to serve up a Certificate Revocation List (CRL) with
-an ordinary web-server, regenerating the CRL before it expires.  The
-downside is that clients have to periodically re-download a potentially
-huge file from every certificate authority it knows of.
-
-The way with most moving parts at query time is Online Certificate
-Status Protocol (OCSP), where the client verifies the certificate
-against an OCSP server run by the CA.  This lets the CA track all
-usage of the certs.  This requires running software with access to the
-private key of the CA, to sign the responses to the OCSP queries.  OCSP
-is based on HTTP and can be proxied accordingly.
-
-The only widespread OCSP server implementation (known to this writer)
-comes as part of OpenSSL and aborts on an invalid request, such as
-connecting to the port and then disconnecting.  This requires
-re-entering the passphrase each time some random client does this.
-
-The third way is OCSP Stapling; in this, the server using a certificate
-issued by the CA periodically requests an OCSP proof of validity from
-the OCSP server, then serves it up inline as part of the TLS
-negotiation.   This approach adds no extra round trips, does not let the
-CA track users, scales well with number of certs issued by the CA and is
-resilient to temporary OCSP server failures, as long as the server
-starts retrying to fetch an OCSP proof some time before its current
-proof expires.  The downside is that it requires server support.
-
-If Exim is built with EXPERIMENTAL_OCSP and it was built with OpenSSL,
-or with GnuTLS 3.1.3 or later, then it gains a new global option:
-"tls_ocsp_file".
-
-The file specified therein is expected to be in DER format, and contain
-an OCSP proof.  Exim will serve it as part of the TLS handshake.  This
-option will be re-expanded for SNI, if the tls_certificate option
-contains $tls_sni, as per other TLS options.
-
-Exim does not at this time implement any support for fetching a new OCSP
-proof.  The burden is on the administrator to handle this, outside of
-Exim.  The file specified should be replaced atomically, so that the
-contents are always valid.  Exim will expand the "tls_ocsp_file" option
-on each connection, so a new file will be handled transparently on the
-next connection.
-
-Under OpenSSL Exim will check for a valid next update timestamp in the
-OCSP proof; if not present, or if the proof has expired, it will be
-ignored.
-
-Also, given EXPERIMENTAL_OCSP, the smtp transport gains two options:
-- "hosts_require_ocsp"; a host-list for which an OCSP Stapling
-is requested and required for the connection to proceed.  The default
-value is empty.
-- "hosts_request_ocsp"; a host-list for which (additionally) an OCSP
-Stapling is requested (but not necessarily verified).  The default
-value is "*" meaning that requests are made unless configured
-otherwise.
-
-The host(s) should also be in "hosts_require_tls", and
-"tls_verify_certificates" configured for the transport.
-
-For the client to be able to verify the stapled OCSP the server must
-also supply, in its stapled information, any intermediate
-certificates for the chain leading to the OCSP proof from the signer
-of the server certificate.  There may be zero or one such. These
-intermediate certificates should be added to the server OCSP stapling
-file (named by tls_ocsp_file).
-
-Note that the proof only covers the terminal server certificate,
-not any of the chain from CA to it.
-
-At this point in time, we're gathering feedback on use, to determine if
-it's worth adding complexity to the Exim daemon to periodically re-fetch
-OCSP files and somehow handling multiple files.
-
-  A helper script "ocsp_fetch.pl" for fetching a proof from a CA
-  OCSP server is supplied.  The server URL may be included in the
-  server certificate, if the CA is helpful.
-
-  One failure mode seen was the OCSP Signer cert expiring before the end
-  of validity of the OCSP proof. The checking done by Exim/OpenSSL
-  noted this as invalid overall, but the re-fetch script did not.
-
-
-
-
  Brightmail AntiSpam (BMI) suppport
  --------------------------------------------------------------
  
@@ -1108,11 +1017,27 @@ an example, in my connect ACL, I have:
         logwrite = Internal Server Address: $received_ip_address:$received_port
  
  
-4. Runtime issues to be aware of:
+4. Recommended ACL additions:
     - Since the real connections are all coming from your proxy, and the
       per host connection tracking is done before Proxy Protocol is
       evaluated, smtp_accept_max_per_host must be set high enough to
       handle all of the parallel volume you expect per inbound proxy.
+   - With the smtp_accept_max_per_host set so high, you lose the ability
+     to protect your server from massive numbers of inbound connections
+     from one IP.  In order to prevent your server from being DOS'd, you
+     need to add a per connection ratelimit to your connect ACL.  I
+     suggest something like this:
+
+  # Set max number of connections per host
+  LIMIT   = 5
+  # Or do some kind of IP lookup in a flat file or database
+  # LIMIT = ${lookup{$sender_host_address}iplsearch{/etc/exim/proxy_limits}}
+
+  defer   message        = Too many connections from this IP right now
+          ratelimit      = LIMIT / 5s / per_conn / strict
+
+
+5. Runtime issues to be aware of:
     - The proxy has 3 seconds (hard-coded in the source code) to send the
       required Proxy Protocol header after it connects.  If it does not,
       the response to any commands will be:
@@ -1131,7 +1056,7 @@ an example, in my connect ACL, I have:
       mail programs from working because that would require mail from
       localhost to use Proxy Protocol.  Again, not advised!
  
-5. Example of a refused connection because the Proxy Protocol header was
+6. Example of a refused connection because the Proxy Protocol header was
  not sent from a host configured to use Proxy Protocol.  In the example,
  the 3 second timeout occurred (when a Proxy Protocol banner should have
  been sent), the banner was displayed to the user, but all commands are
@@ -1205,6 +1130,7 @@ in a router. Exim will then send the success DSN himself if requested as if
  the next hop does not support DSN.
  Adding it to a redirect router makes no difference.
  
+
  Certificate name checking
  --------------------------------------------------------------
  The X509 certificates used for TLS are supposed be verified
@@ -1223,6 +1149,140 @@ a single wildcard being the initial component of a 3-or-more
  component FQDN).
  
  
+DANE
+------------------------------------------------------------
+DNS-based Authentication of Named Entities, as applied
+to SMTP over TLS, provides assurance to a client that
+it is actually talking to the server it wants to rather
+than some attacker operating a Man In The Middle (MITM)
+operation.  The latter can terminate the TLS connection
+you make, and make another one to the server (so both
+you and the server still think you have an encrypted
+connection) and, if one of the "well known" set of
+Certificate Authorities has been suborned - something
+which *has* been seen already (2014), a verifiable
+certificate (if you're using normal root CAs, eg. the
+Mozilla set, as your trust anchors).
+
+What DANE does is replace the CAs with the DNS as the
+trust anchor.  The assurance is limited to a) the possibility
+that the DNS has been suborned, b) mistakes made by the
+admins of the target server.   The attack surface presented
+by (a) is thought to be smaller than that of the set
+of root CAs.
+
+DANE scales better than having to maintain (and
+side-channel communicate) copies of server certificates
+for every possible target server.  It also scales
+(slightly) better than having to maintain on an SMTP
+client a copy of the standard CAs bundle.  It also
+means not having to pay a CA for certificates.
+
+DANE requires a server operator to do three things:
+1) run DNSSEC.  This provides assurance to clients
+that DNS lookups they do for the server have not
+been tampered with.  The domain MX record applying
+to this server, its A record, its TLSA record and
+any associated CNAME records must all be covered by
+DNSSEC.
+2) add TLSA DNS records.  These say what the server
+certificate for a TLS connection should be.
+3) offer a server certificate, or certificate chain,
+in TLS connections which is traceable to the one
+defined by (one of?) the TSLA records
+
+There are no changes to Exim specific to server-side
+operation of DANE.
+
+The TLSA record for the server may have "certificate
+usage" of DANE_TA(2) or DANE_EE(3).  The latter specifies
+the End Entity directly, i.e. the certificate involved
+is that of the server (and should be the sole one transmitted
+during the TLS handshake); this is appropriate for a
+single system, using a self-signed certificate.
+  DANE_TA usage is effectively declaring a specific CA
+to be used; this might be a private CA or a public,
+well-known one.  A private CA at simplest is just
+a self-signed certificate which is used to sign
+cerver certificates, but running one securely does
+require careful arrangement.  If a private CA is used
+then either all clients must be primed with it, or
+(probably simpler) the server TLS handshake must transmit
+the entire certificate chain from CA to server-certificate.
+If a public CA is used then all clients must be primed with it
+(losing one advantage of DANE) - but the attack surface is
+reduced from all public CAs to that single CA.
+DANE_TA is commonly used for several services and/or
+servers, each having a TLSA query-domain CNAME record,
+all of which point to a single TLSA record.
+
+The TLSA record should have a Selector field of SPKI(1)
+and a Matching Type field of SHA2-512(2).
+
+At the time of writing, https://www.huque.com/bin/gen_tlsa
+is useful for quickly generating TLSA records; and commands like
+
+  openssl x509 -in -pubkey -noout <certificate.pem \
+  | openssl rsa -outform der -pubin 2>/dev/null \
+  | openssl sha512 \
+  | awk '{print $2}'
+
+are workable for 4th-field hashes.
+
+For use with the DANE_TA model, server certificates
+must have a correct name (SubjectName or SubjectAltName).
+
+The use of OCSP-stapling should be considered, allowing
+for fast revocation of certificates (which would otherwise
+be limited by the DNS TTL on the TLSA records).  However,
+this is likely to only be usable with DANE_TA.  NOTE: the
+default is to request OCSP for all hosts; the certificate
+chain in DANE_EE usage will be insufficient to validate
+the OCSP proof and verification will fail.  Either disable
+OCSP completely or use the (new) variable $tls_out_tlsa_usage
+like so:
+
+  hosts_request_ocsp = ${if or { {= {4}{$tls_out_tlsa_usage}} \
+                                {= {0}{$tls_out_tlsa_usage}} } \
+                         {*}{}}
+The variable is a bitfield with numbered bits set for TLSA
+record usage codes. The zero above means DANE was not in use,
+the four means that only DANE_TA usage TLSA records were
+found. If the definition of hosts_require_ocsp or
+hosts_request_ocsp includes the string "tls_out_tlsa_usage",
+they are re-expanded in time to control the OCSP request.
+
+[ All a bit complicated.  Should we make that definition
+the default?  Should we override the user's definition? ]
+
+
+For client-side DANE there are two new smtp transport options,
+hosts_try_dane and hosts_require_dane.  They do the obvious thing.
+[ should they be domain-based rather than host-based? ]
+
+DANE will only be usable if the target host has DNSSEC-secured
+MX, A and TLSA records.
+
+(TODO: specify when fallback happens vs. when the host is not used)
+
+If dane is in use the following transport options are ignored:
+  tls_verify_hosts
+  tls_try_verify_hosts
+  tls_verify_certificates
+  tls_crl
+  tls_verify_cert_hostnames
+
+Currently dnssec_request_domains must be active (need to think about that)
+and dnssec_require_domains is ignored.
+
+If verification was successful using DANE then the "CV" item
+in the delivery log line will show as "CV=dane".
+
+There is a new variable $tls_out_dane which will have "yes" if
+verification succeeded using DANE and "no" otherwise (only useful
+in combination with EXPERIMENTAL_TPDA), and a new variable
+$tls_out_tlsa_usage (detailed above).
+
  
  --------------------------------------------------------------
  End of file