[10:29:52] !log admin T219626 replace labtestcontrol2003 with cloudcontrol2001-dev in the clouddb2001-dev database (codfw1dev deployment) [10:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:30:16] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [10:37:00] !log admin T219626 replace 208.80.153.75 with 208.80.153.59 in the clouddb2001-dev database (codfw1dev deployment) [10:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:37:03] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [11:27:42] !log admin T219626 add DB grants for neutron and glnace to clouddb2001-dev (codfw1dev) [11:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:27:45] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [12:07:57] !log admin rebooting cloudvirt200[123]-dev because deep changes in config [12:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:00:56] !log tools moving tools-k8s-master-01 to eqiad1-r [17:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:07:01] !log paws move paws-proxy-02 to point to tools-paws-worker-1006 for upcoming master move [17:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [17:15:01] !log tools add paws outage announcement in configmap hub-config [17:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:15:25] !log paws move paws-proxy-02 reload nginx [17:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [17:30:55] !log tools moving tools-paws-master-01 to eqiad1-r [17:35:17] !log tools.stashbot Test [17:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [20:22:37] o/ Is this the right place to ask someone about SSL certificates on Beta Cluster machines? [20:23:36] dbrant: here, or in -releng [20:24:05] The answer to most deployment-prep questions is: ask Krenair or Tyler [20:24:15] and Tyler (aka thcipriani) is usually in -releng [20:24:26] hi dbrant [20:26:10] heya [20:26:19] So here's the issue: when trying to connect to certain beta servers from the Android app, we've started getting certificate validation errors, along the lines of not being able to determine revocation status. (even though the cert looks ok when browsing through Chrome)... [20:26:55] interesting [20:27:08] I'm not an expert in this stuff, but the only lead I've been able to track is the following: https://github.com/DmcSDK/cordova-plugin-mediaPicker/issues/69#issuecomment-469666841 [20:27:30] "OCSP Stapling renewal" [20:28:02] well the first thing to know is the host doing TLS termination for beta, with the exception of upload, is deployment-cache-text05 [20:28:20] as with prod it's done by nginx and the config is at /etc/nginx/sites-enabled/unified [20:28:42] which does contain [20:28:42] ssl_stapling on; [20:28:43] ssl_stapling_file /etc/acmecerts/unified/live/rsa-2048.client.ocsp; [20:28:43] ssl_stapling_file /etc/acmecerts/unified/live/ec-prime256v1.client.ocsp; [20:32:17] I'm not very familiar with how OCSP works [20:35:49] well [20:36:02] :/ nor am I, but it seems like the likely culprit. [20:36:10] root@deployment-cache-text05:~# openssl ocsp -issuer /etc/acmecerts/unified/live/rsa-2048.chain.crt -cert /etc/acmecerts/unified/live/rsa-2048.crt -text -url http://ocsp.int-x3.letsencrypt.org [20:36:10] ... [20:36:22] Response verify OK [20:36:23] This Update: Apr 16 09:00:00 2019 GMT [20:36:23] Next Update: Apr 23 09:00:00 2019 GMT [20:36:36] same for the ECDSA cert [20:38:00] dbrant, you don't have anything hardcoded about checking revocation status against DigiCert/GlobalSign but not LetsEncrypt do you? [20:38:13] no, nothing like that [20:38:52] can you show the error message you're getting? [20:39:31] IIRC Firefox cares a lot more about OCSP than Chrome but the site is working for me in FF [20:41:15] it's basically the same stack trace as the github issue I mentioned: https://github.com/DmcSDK/cordova-plugin-mediaPicker/issues/69#issuecomment-469597770 [20:42:30] dbrant, is the clock on this device accurate? [20:42:53] yes, it happens on any device. [20:43:29] We did have a user experience OCSP verification issues on prod recently and it turned out to be because their clock was wildly wrong, plus this stack trace has an exception about 'Response is unreliable: its validity interval is out-of-date' [20:43:33] ok [20:45:03] did this problem just start occurring recently? [20:46:27] roughly within the last two or three weeks, but may be a bit longer. [20:47:21] well [20:47:26] I did redo how this all works recently [20:49:02] root@deployment-cache-text05:~# /usr/lib/nagios/plugins/check-fresh-files-in-dir.py -c 259500 -w 173100 -d /var/cache/ocsp -g "*.ocsp" [20:49:02] OK [20:49:46] that seems to be what prod would do to monitor this stuff [20:49:46] !log tools change paws announcement in configmap hub-config back to a welcome message [20:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:52:35] dbrant, is there any chance you could take the code you use to talk to wikis and test just the HTTP part against something in prod using LE certs? [20:53:07] like https://librenms.wikimedia.org/ ? no need to log in, just curious about whether OCSP works there [20:53:41] will do... [20:54:04] that should help make it clear whether this is something specifically broken in deployment-prep or more generally about the new LE puppetisation [20:54:10] is there a task for this? [20:56:52] No task yet, just started to dig into it. [20:57:11] ok [20:57:24] OK, the request to the librenms site was quite successful. [20:57:32] huh [20:57:51] do you have something minimal to reproduce the problem with against beta? [20:59:59] Well it would need to be using the Android app, and monitoring the logs coming from it. [21:00:27] can we take the HTTP library you're using a make a tiny java program to dig into? [21:00:35] the issue [21:00:56] or does it rely on android APIs? [21:02:00] I believe it's independent of Android... Let me try to slap something together. [21:08:03] it'd be nice to be able to dig into exactly what it's expecting vs. what it gets from us [21:08:21] and then figure out what is wrong from there [21:08:35] plus it'd be nice to be able to reproduce it on an ordinary system instead of android [21:24:48] Krenair: Here we go -- https://github.com/dbrant/okhttpssltest [21:27:04] lets see if I can figure out how to build this [21:27:50] oh sorry, you should just be able to check out the repo and run "./gradlew build" and then "./gradlew run" [21:29:06] after installing gradle ;) [21:30:48] > Could not find tools.jar. Please check that /usr/local/java/jre1.8.0_112 contains a valid JDK installation. [21:30:57] alex@alex-laptop:~/Development/Wikimedia/okhttpssltest (master)$ /usr/local/java/jre1.8.0_112/bin/java -version [21:30:57] java version "1.8.0_112" [21:31:43] that looks like a JRE, but do you have a JDK? [21:32:34] no tools.jar or javac in there [21:32:36] so am guessing no [21:32:54] must have one somewhere as javac is in my path [21:33:12] lrwxrwxrwx 1 root root 23 Feb 9 2017 /usr/bin/javac -> /etc/alternatives/javac [21:33:18] lrwxrwxrwx 1 root root 43 Feb 9 2017 /etc/alternatives/javac -> /usr/lib/jvm/java-8-openjdk-amd64/bin/javac [21:33:24] maybe I can make it use that install [21:34:55] not immediately obvious how [21:36:25] there we go [21:36:38] made a gradle.properties file with org.gradle.java.home=/usr/lib/jvm/java-8-openjdk-amd64 [21:37:55] alright so I built it and ran './gradlew run' [21:38:13] > Task :run [21:38:13] Response successful! [21:38:13] BUILD SUCCESSFUL in 1s [21:38:13] 2 actionable tasks: 1 executed, 1 up-to-date [21:43:27] dbrant, was that supposed to fail? [22:32:47] Krenair: sorry, stepped away. Yes, it's supposed to fail. (it's failing for me) [22:33:01] very interesting [22:33:11] can you show the full output you get? [22:34:26] https://pastebin.com/G8gTmRaX [22:38:24] Curious -- when I run it from my MacBook, it works. [22:49:49] it fails on my Windows desktop (and Android) but not Linux or macOS. [22:55:19] !log admin cloudcontrol2003-dev: added `exit 0` to /etc/cron.hourly/keystone to stop cron spam on partially configured cluster [22:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [23:03:51] dbrant, interesting [23:04:01] this does not have anything about OCSP [23:05:54] sounds like a root cert issue [23:06:36] yes [23:06:48] the thing is we do send the intermediate cert for browsers which do not trust LE [23:06:56] 0 s:/CN=*.wikimedia.beta.wmflabs.org [23:06:57] i:/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3 [23:07:03] 1 s:/C=US/O=Let's Encrypt/CN=Let's Encrypt Authority X3 [23:07:03] i:/O=Digital Signature Trust Co./CN=DST Root CA X3 [23:16:54] I wonder if you're getting caught out by https://letsencrypt.org/certificates/#ocsp-signing-certificate dbrant [23:19:38] if the OCSP responses are signed by the ISRG root, how are they supposed to be used by browsers which rely on the DST root [23:21:19] the other question is why would it work with librenms but not beta, their certs are issued by the same CA, the same intermediate is provided. . . [23:28:46] I kind of have a feeling someone is -traffic is going to be able to take one look and know what is wrong [23:30:20] dbrant, did you test it against librenms in windows? [23:35:47] Hmm, on Windows librenms does *not* work, either. [23:38:49] https://letsencrypt.org/docs/certificate-compatibility/ [23:39:00] "The main determining factor for whether a platform can validate Let’s Encrypt certificates is whether that platform includes IdenTrust’s DST Root X3 certificate in its trust store. A secondary factor is whether the platform supports modern SHA-2 certificates, since all Let’s Encrypt certificates use SHA-2." [23:45:57] So, I'm curious what has changed between now and when it worked previously...