[00:04:31] PROBLEM - puppet last run on mw2214 is CRITICAL: CRITICAL: puppet fail [00:16:42] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: puppet fail [00:25:22] (03CR) 10RobH: [C: 032] robh on vacation, removing from paging [puppet] - 10https://gerrit.wikimedia.org/r/300591 (owner: 10RobH) [00:32:51] RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [00:34:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [00:36:32] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [00:37:41] !log doing an emergency deploy of https://gerrit.wikimedia.org/r/#/c/300679 for T141160, creates dozens of new users per hour to be unattached on loginwiki which probably has weird consequences [00:37:42] T141160: Unattached accounts created on registration - https://phabricator.wikimedia.org/T141160 [00:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:02] (03PS1) 10Yuvipanda: k8s: Upgrade to 1.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/300688 [00:38:12] (03PS3) 10Yuvipanda: tools: Alert when iowait info is missing as well [puppet] - 10https://gerrit.wikimedia.org/r/300573 (https://phabricator.wikimedia.org/T141017) [00:38:21] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Alert when iowait info is missing as well [puppet] - 10https://gerrit.wikimedia.org/r/300573 (https://phabricator.wikimedia.org/T141017) (owner: 10Yuvipanda) [00:38:47] (03PS2) 10Yuvipanda: k8s: Upgrade to 1.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/300688 [00:38:56] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Upgrade to 1.3.3 [puppet] - 10https://gerrit.wikimedia.org/r/300688 (owner: 10Yuvipanda) [00:39:42] Question: I just tried connection to bast4001 for the first time, and the key fingerprint I got doesn't match the one on wikitech (https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast4001.wikimedia.org). Is wikitech outdated? [00:40:10] This is what I got when I tried to SSH: `ECDSA key fingerprint is SHA256:UUsRMiUK9CkPg8yMiEPAKjs1PEhKxQPT+xhi4xRnjks.` [00:40:26] (03PS2) 10Ppchelko: Change-Prop: Updates to error hangling [puppet] - 10https://gerrit.wikimedia.org/r/300681 [00:44:52] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:46:47] (03PS3) 10Legoktm: Change-Prop: Updates to error handling [puppet] - 10https://gerrit.wikimedia.org/r/300681 (owner: 10Ppchelko) [00:53:42] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: puppet fail [01:00:46] !log tgr@tin Synchronized php-1.28.0-wmf.11/extensions/CentralAuth/includes/CentralAuthPrimaryAuthenticationProvider.php: T141160 (duration: 00m 28s) [01:00:47] T141160: Unattached accounts created on registration - https://phabricator.wikimedia.org/T141160 [01:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:25] !log tgr@tin Synchronized php-1.28.0-wmf.11/extensions/CentralAuth/includes/CentralAuthHooks.php: T141160 (duration: 00m 27s) [01:01:27] T141160: Unattached accounts created on registration - https://phabricator.wikimedia.org/T141160 [01:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:02:06] !log tgr@tin Synchronized php-1.28.0-wmf.11/extensions/CentralAuth/includes/CentralAuthPlugin.php: T141160 (duration: 00m 29s) [01:02:07] T141160: Unattached accounts created on registration - https://phabricator.wikimedia.org/T141160 [01:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:16:33] neilpquinn: looks like it (wikitech outdated) [01:19:53] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:25:57] alex@alex-laptop:~$ ssh-keyscan -v -t rsa,ecdsa,ed25519 -p 22 bast4001.wikimedia.org [01:25:57] alex@alex-laptop:~$ [01:25:57] strange [01:27:43] okay I think it might just be my connection [01:32:13] of course all the servers I can successfully run it on don't have OpenSSH 6.8+, so no SHA256 fingerprints [01:37:45] neilpquinn, https://wikitech.wikimedia.org/w/index.php?title=Help:SSH_Fingerprints/bast4001.wikimedia.org&diff=781106&oldid=192162 [01:38:22] not sure why the host keys changed then [01:38:36] only thing in SAL is robh taking it down to add a new HDD [01:39:24] Thanks for investigating (and updating the docs), Krenair! [01:40:25] (03PS1) 10Chad: Gerrit: Remove googlebot from banned IPs. They ain't so bad [puppet] - 10https://gerrit.wikimedia.org/r/300692 [01:42:40] ori: Got a second to look at that? ^ [01:46:10] looks sane and I saw the context for this in the backlog on another channel, but mark wouldn't be very happy with me if I merged that -- I was asked not to merge things that have to do with security and not to merge on weekends, and this meets both criteria... sorry. [01:47:50] Okie dokie no worries. [01:48:17] It's not a rush at all, just saw you around :) [01:48:45] In that case I'm gonna go start_weekend() [01:49:53] * ori waves [01:55:14] neilpquinn, turns out the whole thing was reinstalled too, which explains the change [01:55:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 236 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:56:05] neilpquinn, https://phabricator.wikimedia.org/T129180 [02:01:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:21:08] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 08m 24s) [02:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:49] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jul 23 02:26:49 UTC 2016 (duration 5m 41s) [02:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:01] RECOVERY - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is OK: TCP OK - 0.038 second response time on port 9042 [03:23:51] PROBLEM - Labs LDAP on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:11] PROBLEM - SSH on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:21] PROBLEM - salt-minion processes on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:24:21] PROBLEM - Disk space on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:24:21] PROBLEM - LibreNMS HTTPS on netmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:32] PROBLEM - puppet last run on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:24:42] PROBLEM - configured eth on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:02] PROBLEM - DPKG on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:02] PROBLEM - dhclient process on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:03] PROBLEM - Check size of conntrack table on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:11] PROBLEM - Tool Labs instance distribution on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:13] PROBLEM - Getent speed check on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:25:22] PROBLEM - Tool Labs instance distribution on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:27:23] PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:29:02] RECOVERY - Getent speed check on labstore1002 is OK: OK: getent group returns within a second [03:31:11] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [03:32:31] RECOVERY - LibreNMS HTTPS on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8685 bytes in 7.327 second response time [03:34:12] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.014 second response time [03:34:12] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.021 second response time [03:35:03] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 341 bytes in 0.011 second response time [03:35:12] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.043 second response time [03:35:32] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.017 second response time [03:35:32] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.039 second response time [03:38:57] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.049 second response time [03:39:30] is someone doing something w/ tools checker?^ [03:39:31] YuviPanda: ? [03:39:52] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures [03:40:26] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 0.009 second response time [03:40:26] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.008 second response time [03:40:31] Is Gerrit down for anyone else? Is this planned? [03:41:09] I'm getting "server unavailable" messages after a long time of it sitting at "Working..." in the top bar [03:41:21] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 341 bytes in 0.039 second response time [03:41:21] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.038 second response time [03:41:21] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.044 second response time [03:41:51] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 341 bytes in 0.005 second response time [03:43:43] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.027 second response time [03:43:43] uh [03:44:06] YuviPanda: I think seaborgium.wikimedia.org is down? [03:44:19] or is it [03:44:32] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.032 second response time [03:44:32] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 341 bytes in 0.027 second response time [03:44:32] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.045 second response time [03:44:32] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 341 bytes in 0.032 second response time [03:44:36] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.048 second response time [03:45:16] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.033 second response time [03:45:20] scap on beta is failing with: [03:45:21] 03:36:29 sudo: ldap_start_tls_s(): Can't contact LDAP server [03:45:21] 03:36:29 sudo: a password is required [03:46:33] PROBLEM - LibreNMS HTTPS on netmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:43] chasemp yeah, that might be it - I can't ssh in [03:47:43] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 341 bytes in 0.018 second response time [03:47:59] can ssh to serpens [03:48:10] !log gnt-instance reboot seaborgium.wikimedia.org [03:48:26] ^ that is taking a long time w/ no status [03:49:10] failover is going to be useless since all of labs ssh is down anyway [03:49:18] or at least tools-checker-01 is [03:49:39] nope, I can ssh to tools-login [03:49:44] just both checker nodes are unsshable [03:49:50] yeah so [03:49:51] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.028 seconds response time [03:49:55] I rebooted seaborgium [03:49:59] and I can ssh in now [03:50:00] so [03:50:03] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [03:50:10] serpens is not much of an effective backup it seems? and I don't get why [03:50:23] RECOVERY - Disk space on seaborgium is OK: DISK OK [03:50:23] RECOVERY - salt-minion processes on seaborgium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:50:23] RECOVERY - LibreNMS HTTPS on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8684 bytes in 0.178 second response time [03:50:24] although I'm not positive there aren't other problems [03:50:32] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [03:50:42] RECOVERY - configured eth on seaborgium is OK: OK - interfaces up [03:50:51] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 56.432 second response time [03:50:51] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 57.345 second response time [03:50:52] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.245 second response time [03:50:56] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 31.587 second response time [03:50:56] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 16.269 second response time [03:50:56] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.039 second response time [03:50:56] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.016 second response time [03:50:56] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.238 second response time [03:51:00] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 17.084 second response time [03:51:12] RECOVERY - dhclient process on seaborgium is OK: PROCS OK: 0 processes with command name dhclient [03:51:12] RECOVERY - DPKG on seaborgium is OK: All packages OK [03:51:13] RECOVERY - Check size of conntrack table on seaborgium is OK: OK: nf_conntrack is 0 % full [03:51:21] RECOVERY - Tool Labs instance distribution on labcontrol1001 is OK: OK: All critical toollabs instances are spread out enough [03:51:26] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.047 second response time [03:51:26] RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.058 second response time [03:51:27] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.708 second response time [03:51:27] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.743 second response time [03:51:27] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.693 second response time [03:51:31] RECOVERY - Tool Labs instance distribution on labcontrol1002 is OK: OK: All critical toollabs instances are spread out enough [03:51:51] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2489274 (10chasemp) I had to reboot seaborgium today as it froze up and took out ldap with it. > !log gnt-instance reboot seaborgium.wikimedia.org I would say...definitely something is sti... [03:52:01] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.126 second response time [03:53:01] well, it seems somewhat clear that was the deal [03:53:13] I left a note in https://phabricator.wikimedia.org/T130593 [03:53:17] (03PS1) 10Yuvipanda: tools: Run toolschecker in debug mode [puppet] - 10https://gerrit.wikimedia.org/r/300696 [03:53:41] chasemp ^ alll the toolschecker logs only said '503' and didn't actually mention what or why... [03:53:45] this should help if it happens again [03:53:49] ok [03:54:00] (03PS2) 10Yuvipanda: tools: Run toolschecker in debug mode [puppet] - 10https://gerrit.wikimedia.org/r/300696 [03:54:09] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Run toolschecker in debug mode [puppet] - 10https://gerrit.wikimedia.org/r/300696 (owner: 10Yuvipanda) [03:55:21] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:58:25] 06Operations, 13Patch-For-Review: Staging area for the next version of the transparency report - https://phabricator.wikimedia.org/T138197#2392917 (10Peachey88) @ori Are there any tasks remaining to keep this task open? [04:00:03] PROBLEM - NTP on seaborgium is CRITICAL: NTP CRITICAL: Offset unknown [04:02:14] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:08:12] RECOVERY - NTP on seaborgium is OK: NTP OK: Offset -0.0009208917618 secs [04:24:12] PROBLEM - Unmerged changes on repository puppet on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:24:41] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:13] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:21] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:21] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:21] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:23] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:29:05] YuviPanda, chasemp: I think gerrit is down? [04:32:17] gerrit is down [04:32:36] legoktm: yup, it's down. I asked several friends [04:32:42] hey btw. [04:48:46] !log Users report Gerrit is down; on ytterbium java is occupying two cores at 100% [04:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:51:29] I'm going to dump core and restart it [04:51:35] the service, that is; not the host [04:52:52] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [04:56:44] !log Restarting Gerrit on ytterbium [04:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:57:20] it's back up [04:58:17] !log Gerrit is back up after service restart; was unavailable between ~ 04:29 - 04:57 UTC [04:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:58:22] PROBLEM - Unmerged changes on repository puppet on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:52] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:52] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [04:59:01] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:59:01] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [04:59:02] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [04:59:39] legoktm: in the future please page ops for something like this [04:59:51] RECOVERY - Unmerged changes on repository puppet on labcontrol1001 is OK: No changes to merge. [05:00:02] Gerrit is not a high-traffic production wiki, but it is important infrastructure, and it can't just sit dead in the water [05:00:21] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [05:00:21] RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge. [05:00:22] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 3 failures [05:00:51] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [05:01:33] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures [05:01:41] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [05:02:02] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures [05:02:02] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 1 failures [05:02:43] ori: ack, will do next time [05:04:01] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:08:16] legoktm: first user reports of gerrit being down are from 04:29, but I see suspicious log messages starting from 04:08. is it possible it became unresponsive around then, or were you working on gerrit right before 04:29? [05:08:50] ori: no, that's just when I tried opening a patch [05:09:12] [21:27:34] Wikipedia-Android-App-Backlog, Easy, Mobile-App-Android-Sprint-87-Francium, Patch-For-Review, Unplanned-Sprint-Work: Remove support for custom application ID - https://phabricator.wikimedia.org/T103903#2489281 (Niedzielski) I have a patch for this but Gerrit is down. Will post REAL soon. [05:09:23] [21:23:12] and is Gerrit down for everyone? (in #mediawiki) [05:10:10] niedzielski-afk: it's back up [05:14:21] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 14894 MB (3% inode=99%) [05:18:32] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: puppet fail [05:19:42] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [05:20:03] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:22:32] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:25:11] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:27:12] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:31:29] 06Operations, 10MediaWiki-extensions-UrlShortener, 10Traffic: Strip query parameters from w.wiki domain - https://phabricator.wikimedia.org/T141170#2489295 (10Legoktm) [05:32:02] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:46:03] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:05:01] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [06:06:51] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [06:29:12] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 14780 MB (3% inode=99%) [06:30:51] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [06:31:22] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:22] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:33] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:33] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:23] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:39:12] RECOVERY - Disk space on lithium is OK: DISK OK [06:56:32] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:57:11] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:12] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:38:03] PROBLEM - SSH on db1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:38:21] PROBLEM - MegaRAID on db1066 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:39:52] RECOVERY - SSH on db1066 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [07:40:03] RECOVERY - MegaRAID on db1066 is OK: OK: optimal, 1 logical, 2 physical [08:41:22] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 185 bytes in 1.498 second response time [11:06:49] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2489499 (10Jonas) https://wikitech.wikimedia.org/wiki/User:Jonas_Kress_(WMDE) Thanks! [11:22:52] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [11:24:16] That can't be good [11:26:51] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:42:11] 06Operations, 06Discovery, 10netops, 03Discovery-Search-Sprint: deploy elasticsearch/plugins to relforge1001-1002 servers - https://phabricator.wikimedia.org/T141085#2489518 (10Aklapper) [12:40:22] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: puppet fail [12:51:38] travelling back to home base, back online in about 4 hours I guess [13:06:03] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:46:31] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214216 (10Josve05a) I can't delete https://commons.wikimedia.org/wiki/File:%D8%B3%... [14:08:42] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [14:10:33] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5082716 keys - replication_delay is 0 [15:24:31] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 14923 MB (3% inode=99%) [15:37:12] !log lithium sudo lvextend --size +10G -r /dev/mapper/lithium--vg-syslog [15:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:43] !log stop swift in esams test cluster, lots of logging from there [15:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:00] 06Operations, 06Discovery, 06Maps, 10Monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#2489781 (10MaxSem) [17:09:32] PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: puppet fail [17:38:31] RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:46:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [18:52:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:13:23] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2490000 (10Brentjoseph) @Dzahn done! sorry for the delay: wikitech username: Brentjoseph shell account name: bc... [19:38:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [19:44:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:15:32] (03PS1) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 [20:18:31] (03PS2) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 [20:22:12] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 1 failures [20:25:21] (03PS3) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 [20:25:53] (03PS4) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 [20:27:02] (03PS5) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 [20:28:18] (03PS6) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 [20:29:29] (03PS7) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 [20:33:01] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:32] (03PS8) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 [20:48:22] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:54:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:54:46] (03PS1) 10Yuvipanda: tools: Track per tool request stats in graphite [puppet] - 10https://gerrit.wikimedia.org/r/300737 (https://phabricator.wikimedia.org/T69880) [20:55:13] (03PS2) 10Yuvipanda: tools: Track per tool request stats in graphite [puppet] - 10https://gerrit.wikimedia.org/r/300737 (https://phabricator.wikimedia.org/T69880) [20:58:11] (03PS1) 10Hashar: contint: role for Android testing [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://bugzilla.wikimedia.org/139137) [20:58:52] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:00:51] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:01:09] (03PS3) 10Yuvipanda: tools: Track per tool request stats in graphite [puppet] - 10https://gerrit.wikimedia.org/r/300737 (https://phabricator.wikimedia.org/T69880) [21:02:01] (03CR) 10Paladox: "You misspelt the bug." [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://bugzilla.wikimedia.org/139137) (owner: 10Hashar) [21:04:22] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [21:04:22] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:31] (03PS2) 10Luke081515: contint: role for Android testing [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar) [21:05:03] now it makes sense^^ [21:06:06] Thanks [21:18:12] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: puppet fail [21:30:43] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 84.27 ms [21:31:21] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.28 ms [21:31:31] (03CR) 10Hashar: "Thank you Paladox and Luke081515 :-]" [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar) [21:31:57] (03CR) 10Paladox: "Your welcome :)." [puppet] - 10https://gerrit.wikimedia.org/r/300738 (https://phabricator.wikimedia.org/T139137) (owner: 10Hashar) [21:36:51] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 3 failures [21:44:23] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [21:45:12] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.164 second response time [21:52:43] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:58:52] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:00:52] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [23:32:41] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:38:29] (03PS1) 10Yuvipanda: tools: Add timeout to ldap connection for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/300741 (https://phabricator.wikimedia.org/T141203) [23:38:41] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:39:17] (03CR) 10Yuvipanda: [C: 032] tools: Track per tool request stats in graphite [puppet] - 10https://gerrit.wikimedia.org/r/300737 (https://phabricator.wikimedia.org/T69880) (owner: 10Yuvipanda) [23:39:54] (03Abandoned) 10Yuvipanda: Add a logster for parsing out top level urls [debs/logster] - 10https://gerrit.wikimedia.org/r/300717 (owner: 10Yuvipanda) [23:42:20] (03CR) 10Alex Monk: "Isn't this supposed to be given to the ldap3.Server constructor?" [puppet] - 10https://gerrit.wikimedia.org/r/300741 (https://phabricator.wikimedia.org/T141203) (owner: 10Yuvipanda)