[01:17:17] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:22:58] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1461 bytes in 0.144 second response time [01:34:49] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [150.0] [01:34:57] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [01:34:58] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] [01:35:12] PROBLEM - Kafka Broker Server on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [01:35:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [150.0] [01:35:27] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [10.0] [01:38:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [150.0] [01:38:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [10.0] [01:38:08] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [150.0] [01:38:08] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [10.0] [01:38:48] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 23.08% of data above the critical threshold [150.0] [01:39:38] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [150.0] [01:42:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [01:42:47] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [01:45:48] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [01:47:18] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [01:48:00] PROBLEM - Disk space on antimony is CRITICAL: DISK CRITICAL - free space: / 3672 MB (3% inode=69%) [01:49:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [01:49:38] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [01:50:18] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [01:50:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [01:50:19] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:50:34] RECOVERY - Kafka Broker Server on kafka1013 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [01:50:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [01:51:38] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1012 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [10.0] [01:51:47] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [10.0] [01:52:28] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [10.0] [01:56:17] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [10.0] [01:59:28] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 1.00% above the threshold [1.0] [02:00:12] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:17] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 1.00% above the threshold [1.0] [02:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:01:38] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 1.00% above the threshold [1.0] [02:04:38] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 1.00% above the threshold [1.0] [02:04:41] Starting l10nupdate at Fri Nov 27 02:00:01 UTC 2015. [02:04:41] Updating git clone ... [02:04:41] fatal: You don't exist. Go away! [02:04:42] Unexpected end of command stream [02:04:42] Updating core FAILED. [02:08:08] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 1.00% above the threshold [1.0] [02:11:24] 6operations, 10Wikimedia-General-or-Unknown: localisationupdate broken on wmf wikis - https://phabricator.wikimedia.org/T119746#1834606 (10Reedy) 3NEW [02:14:16] !log killed pre 1.27 l10n cache files from tin:/var/lib/l10nupdate/caches [02:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:21:18] Reedy, yeah, that's about as far as I got too :( [02:21:32] it's running fine manually [02:21:36] yup [02:22:09] mutante probably broke it (unintentionally) [02:22:24] oh, actually, there was one thing [02:22:32] at some point it failed for a different reason on me [02:22:35] without logging an error [02:22:50] since https://gerrit.wikimedia.org/r/255321 it should log failure to SAL now [02:24:25] # modified: portals (new commits) [02:25:20] Rebuilding localization cache at 2015-11-27 02:18:42+00:00 [02:45:26] !log reedy@tin Synchronized php-1.27.0-wmf.7/cache/l10n: l10nupdate for 1.27.0-wmf.7 (duration: 09m 10s) [02:45:27] !log l10nupdate@tin LocalisationUpdate failed: Failed to sync-dir 'php-1.27.0-wmf.7/cache/l10n' [02:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:41] Well isn't that stupid [02:45:46] It worked fine, master co-sync didn't [02:47:47] !log manually running dsh -g mediawiki-installation -M -F 40 -- "sudo -u mwdeploy /srv/deployment/scap/scap/bin/scap-rebuild-cdbs" [02:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:19] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:52:48] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:53:38] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:53:46] https://github.com/wikimedia/operations-puppet/blob/812f280d16acfe3083259e8dfa7ce12ebf71da87/modules/scap/files/l10nupdate-1#L109 [02:53:52] Good to see NOLOGMSG=1 working [02:54:08] PROBLEM - puppet last run on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:17] PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:27] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:47] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:49] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:18] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:48] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:49] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:58] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:38] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [03:02:39] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [03:02:48] RECOVERY - Disk space on mw1135 is OK: DISK OK [03:02:58] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [03:02:58] RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 0 % full [03:03:08] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:03:28] RECOVERY - DPKG on mw1135 is OK: All packages OK [03:03:37] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [03:03:38] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:03:57] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 36 minutes ago with 0 failures [03:03:58] RECOVERY - RAID on mw1135 is OK: OK: no RAID installed [03:04:08] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [04:42:27] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:42:47] PROBLEM - puppet last run on mw2060 is CRITICAL: CRITICAL: Puppet has 1 failures [04:48:08] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1464 bytes in 1.411 second response time [05:07:58] RECOVERY - puppet last run on mw2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:59:02] 6operations: Investigate redis connections errors on rdb100[13] - https://phabricator.wikimedia.org/T119739#1834711 (10ori) How can you tell which redis backends are involved? They are not mentioned in the exception message, so you must be doing some additional work to correlate log entries with backends. Probab... [06:18:15] !log coal metrics flat since 01:25:00 AM UTC [06:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:18:57] PROBLEM - puppet last run on restbase-test2002 is CRITICAL: CRITICAL: puppet fail [06:30:57] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:49] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:59] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:59] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:44:17] RECOVERY - puppet last run on restbase-test2002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:08] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:17] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:48] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:18] !log ran eventloggingctl stop / eventloggingctl start on eventlog1001 [06:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:58:08] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:23:38] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [150.0] [07:24:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [150.0] [07:25:27] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [500.0] [07:25:59] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:27:19] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:27:37] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:27:37] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:38] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:39] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:33:11] 6operations: Investigate redis connections errors on rdb100[13] - https://phabricator.wikimedia.org/T119739#1834733 (10Joe) From a first inspection, I see we have rdb enabled again. and it's been saved every 60 seconds; this might be a source of problems; however without some more logging in mediawiki I am unabl... [07:37:41] 6operations: Investigate redis connections errors on rdb100[13] - https://phabricator.wikimedia.org/T119739#1834734 (10Joe) p:5Triage>3High [07:39:27] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:39:27] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [07:50:05] 6operations: Investigate redis connections errors on rdb100[13] - https://phabricator.wikimedia.org/T119739#1834735 (10Joe) Correlating a shower of failures from rdb1003 I found the following in the logs: ``` root@rdb1003:/srv/redis# grep 07:45:40 /var/log/redis/tcp_6380.log [1184] 27 Nov 07:45:40.665 * 10000 c... [07:50:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [08:12:41] (03PS1) 10Giuseppe Lavagetto: redis: the service is not daemonized by default on ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/255664 [08:14:49] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: the service is not daemonized by default on ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/255664 (owner: 10Giuseppe Lavagetto) [08:31:09] <_joe_> !log rolling restart of all redis instances on rdb1001 and 1003 (later 1002/4) to fix the upstart bug [08:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:23] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: puppet fail [08:38:11] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:49:29] PROBLEM - puppet last run on antimony is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [10:13:39] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [10:16:13] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:16:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [150.0] [10:18:08] (03CR) 10Alexandros Kosiaris: [C: 031] Add extra_schemas to openldap labs role [puppet] - 10https://gerrit.wikimedia.org/r/255666 (owner: 10Muehlenhoff) [10:18:38] RECOVERY - puppet last run on antimony is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:47] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [10:19:38] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:19:58] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:19:58] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:13:12] tto: it's been kicked repeatedly, no joy [12:13:19] there's an open ticket in phab [12:13:20] stupid thing [12:13:42] I tried Phabricator Diffusion, but it doesn't seem to let you search repositories by name! [12:14:07] github as a temp, workaround? [12:14:25] ah yes, there's that too [12:14:28] or diffusion, everything's in there isn't it? [12:15:23] https://github.com/wikimedia/integration-uprightdiff doesn't appear to be in diffusion..? [12:15:24] As I said, it's very difficult to find anything in Diffusion [12:23:12] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [150.0] [12:23:27] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [150.0] [12:23:53] <_joe_> uhm issues? [12:24:43] <_joe_> yay graphite down [12:25:51] https://logstash.wikimedia.org/#/dashboard/elasticsearch/redis is looking happier [12:26:10] <_joe_> AaronSchulz: I know [12:26:58] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [12:27:23] :) [12:28:02] looking at graphite [12:29:22] <_joe_> btw these specific alarms are helpful after all :) [12:33:01] <_joe_> AaronSchulz: we also had the upstart config that had an error, fixing which probably caused a few errors itself [12:33:08] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 7.02 ms [12:33:21] <_joe_> (the ones before of the sudden stop [12:56:49] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [13:08:09] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:09:18] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 11.43 ms [13:09:58] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48383 bytes in 0.627 second response time [13:13:28] 6operations, 10netops, 7Monitoring: Juniper monitoring - https://phabricator.wikimedia.org/T83992#1835065 (10faidon) [13:14:57] 6operations, 10netops, 7Monitoring: Juniper monitoring - https://phabricator.wikimedia.org/T83992#921206 (10faidon) [13:16:24] (03PS1) 10Filippo Giunchedi: graphite: add http referer ban capability [puppet] - 10https://gerrit.wikimedia.org/r/255695 (https://phabricator.wikimedia.org/T119718) [13:17:05] 6operations, 7Graphite, 5Patch-For-Review: Make it easier to ban misbehaving dashboards from graphite - https://phabricator.wikimedia.org/T119718#1835072 (10fgiunchedi) moreover, it should be possible to entirely ban grafana (i.e. `POST /render`) so that for example `check_graphite` isn't affected [13:18:08] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:19:39] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [150.0] [13:21:37] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [13:22:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [150.0] [13:27:38] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.107 seconds response time [13:31:00] 6operations, 10netops: Investigate why disabling an uplink port did not deprioritize VRRP on cr2-eqiad - https://phabricator.wikimedia.org/T119759#1835092 (10mark) [13:33:57] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:33:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [13:34:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [13:43:58] 6operations, 10Gitblit: git.wikimedia.org down: 504 Gateway Time-out - https://phabricator.wikimedia.org/T119701#1835097 (10Aklapper) >>! In T119701#1835031, @planetenxin wrote: > Is it possible to still download a specific commit as tar.gz when using Phabricator Diffusion? See T111887 [13:53:56] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1835115 (10BurritoBazooka) It seems this image is also affected: https://commons.wikimedia.org/wiki/File:Glasl's_Model_of_Confl... [14:16:07] !log fundraising database maintenance [14:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:08] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [14:25:08] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1547281 Threads: 1 Questions: 28524504 Slow queries: 6900 Opens: 34122 Flush tables: 2 Open tables: 63 Queries per second avg: 18.435 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [14:26:54] (03PS3) 10coren: Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) [14:29:48] (03PS2) 10Filippo Giunchedi: diamond: send log to stdout at level INFO [puppet] - 10https://gerrit.wikimedia.org/r/255528 [14:30:55] (03CR) 10Filippo Giunchedi: "I've disabled --log-stdout from diamond in diamond 3.5-5 (to be deployed next week)" [puppet] - 10https://gerrit.wikimedia.org/r/255528 (owner: 10Filippo Giunchedi) [14:32:48] RECOVERY - puppet last run on rdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:33:58] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: puppet fail [14:42:36] (03PS3) 10JanZerebecki: Fix wikidata redirect that come in via https to target https [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) [14:43:19] (03CR) 10JanZerebecki: Fix wikidata redirect that come in via https to target https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [14:54:18] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:56:28] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48387 bytes in 2.034 second response time [15:01:58] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:02:48] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:09] !log kill stray cpu hog xelatex on ocg1003, orphan and started on oct22 [15:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:35] parsoid's p99 today's really high - https://grafana.wikimedia.org/dashboard/db/restbase?panelId=9&fullscreen [15:12:28] mobrovac: zooming out 30d looks normal to me [15:13:50] yeah godog, but the mean went up to 2.5s today which isn't normal [15:14:02] it's down now, though [15:14:32] hm, no, no, disregard [15:14:52] a mean of 2.5s for parsoid isn't that bad actually when you think about it [15:58:48] 6operations, 7Graphite: 500 errors from graphite shouldn't be retried by varnish - https://phabricator.wikimedia.org/T119721#1835202 (10fgiunchedi) retry sample from latest outage logs ```lines=4 graphite.wikimedia.org:80 10.64.0.107 - - [25/Nov/2015:14:49:00 +0000] "GET /render?format=json&from=-1hour&target... [15:58:58] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time [15:59:48] bblack: speaking of which ^ :| [16:02:12] godog: plus I added layers and tiers to misc-cluster the other day, which will make retry5xx's bad behavior multiply worse now [16:03:14] (03PS1) 10BBlack: Switch misc cluster to text-like retry5(03|xx) behaviors [puppet] - 10https://gerrit.wikimedia.org/r/255706 [16:04:08] godog: there's a thread with wdqs complaints about traffic today too, which is also misc cluster and may have been affected like graphite. [16:04:21] https://gerrit.wikimedia.org/r/#/c/255706/ is probably the right thing to do here and should be a pretty safe change to merge [16:04:29] but one of you guys will need to do it, as I really need to pack up and go now [16:04:58] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.016 second response time [16:05:00] bblack: kk, thanks for your help! [16:06:18] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [150.0] [16:06:57] (03CR) 10Filippo Giunchedi: [C: 031] Switch misc cluster to text-like retry5(03|xx) behaviors [puppet] - 10https://gerrit.wikimedia.org/r/255706 (owner: 10BBlack) [16:09:44] paravoid: thoughts on https://gerrit.wikimedia.org/r/255706 ? I can merge/babysit, graphite is suffering ATM due to 5xx retry [16:11:08] busy [16:16:27] ack, I'll merge and keep an eye on it [16:16:59] (03PS1) 10Andrew Bogott: Create temporary labs-ns1placeholder entry. [dns] - 10https://gerrit.wikimedia.org/r/255708 (https://phabricator.wikimedia.org/T119762) [16:17:01] (03PS1) 10Andrew Bogott: Put an ldap-based dns server on labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/255709 (https://phabricator.wikimedia.org/T119762) [16:17:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Switch misc cluster to text-like retry5(03|xx) behaviors [puppet] - 10https://gerrit.wikimedia.org/r/255706 (owner: 10BBlack) [16:17:58] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [16:18:16] (03PS2) 10Andrew Bogott: Create temporary labs-ns1placeholder entry. [dns] - 10https://gerrit.wikimedia.org/r/255708 (https://phabricator.wikimedia.org/T119762) [16:20:22] 6operations, 7Graphite: 500 errors from graphite shouldn't be retried by varnish - https://phabricator.wikimedia.org/T119721#1835226 (10fgiunchedi) see also https://gerrit.wikimedia.org/r/#/c/255706/ [16:20:38] (03CR) 10coren: [C: 031] "That's a good name to remind everyone it needs to go away. :-)" [dns] - 10https://gerrit.wikimedia.org/r/255708 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [16:21:06] (03PS2) 10Andrew Bogott: Put an ldap-based dns server on labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/255709 (https://phabricator.wikimedia.org/T119762) [16:22:28] (03CR) 10coren: [C: 031] "Appears to do what it intends." [puppet] - 10https://gerrit.wikimedia.org/r/255709 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [16:23:29] (03PS3) 10Andrew Bogott: Put an ldap-based dns server on labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/255709 (https://phabricator.wikimedia.org/T119762) [16:24:01] (03CR) 10Andrew Bogott: [C: 032] Create temporary labs-ns1placeholder entry. [dns] - 10https://gerrit.wikimedia.org/r/255708 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [16:29:40] (03CR) 10Andrew Bogott: [C: 032] Put an ldap-based dns server on labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/255709 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [16:33:05] (03PS1) 10Andrew Bogott: Labcontrol1002: consolidate role lines [puppet] - 10https://gerrit.wikimedia.org/r/255710 (https://phabricator.wikimedia.org/T119762) [16:33:49] (03CR) 10coren: [C: 031] "Cosmetic/lint" [puppet] - 10https://gerrit.wikimedia.org/r/255710 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [16:34:52] (03CR) 10Andrew Bogott: [C: 032] Labcontrol1002: consolidate role lines [puppet] - 10https://gerrit.wikimedia.org/r/255710 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [16:35:26] yurik: deploying graphoid on a holiday on friday? [16:35:29] not wise, if you ask me [16:35:38] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: puppet fail [16:35:44] mobrovac, only on beta ;) [16:36:01] kk [16:37:37] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:57:57] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: puppet fail [16:59:40] akosiaris, could you do your magic again? graphoid is not syncing to beta cluster [17:02:05] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48387 bytes in 2.008 second response time [17:07:05] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:54] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 48387 bytes in 0.363 second response time [17:26:44] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:30:37] (03PS1) 10Andrew Bogott: Make sure dns on labcontrol2001.yaml keeps listening on 208.80.153.15 [puppet] - 10https://gerrit.wikimedia.org/r/255713 (https://phabricator.wikimedia.org/T119762) [17:30:40] godog: I'm guessing http://graphite.wikimedia.org/admin/ cant just be accessed with my ldap details? [17:31:08] (03PS2) 10Andrew Bogott: Make sure dns on labcontrol2001 keeps listening on 208.80.153.15 [puppet] - 10https://gerrit.wikimedia.org/r/255713 (https://phabricator.wikimedia.org/T119762) [17:31:55] addshore: nope, we're not using that [17:34:05] okay, I was looking at storing events, do you have any idea how to /just store them in elastic/? I may be able to come up with a far better solution though (reading from wikipages) [17:34:55] (03CR) 10coren: [C: 031] "Appears to be sane." [puppet] - 10https://gerrit.wikimedia.org/r/255713 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [17:38:04] addshore: deployment events you see in grafana ATM are in graphite, implemented by writing "1" to a metric when the event occurs, no idea about elasticsearch tho [17:38:15] ahh okay [17:38:38] I might look at creating an annotations plugin to read events from a wikipage, I feel that would more than likely be useful for all [17:39:52] also, I just managed to get a small graphite feature sortByName with a natural parameter backported to 0.9.x, once 0.9.15 is released I'll probably make a ticket asking for an upgrade :D [17:40:35] addshore: plugin for grafana annotations? yeah that sounds useful to other people too [17:40:51] also, I made this in light of what happened today https://github.com/grafana/grafana/issues/3356 [17:40:52] sweet, yeah there should be debian packages too after the release [17:41:22] addshore: ooh thanks! I'll watch that issue, yeah ATM that's not possible in grafana it seems [17:41:42] It should be a fairly easy one :) [17:42:14] but the rate of issue creation / feature requests for grafana is quite high :P [17:43:12] hehehe indeed, we'll see! [17:43:21] but yeh, grafana is awesome ;) [17:44:52] 6operations, 7Graphite: Enforce a minimum refresh period for grafana dashboards hitting graphite - https://phabricator.wikimedia.org/T119719#1835307 (10Addshore) See https://github.com/grafana/grafana/issues/3356 [17:45:40] (03CR) 10Andrew Bogott: [C: 032] "This is for sure a no-op until DNS changes." [puppet] - 10https://gerrit.wikimedia.org/r/255713 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [17:46:57] !log re-enabling cr2-eqiad:xe-5/2/2 (link to cr1-ulsfo) and xe-5/2/3 (link to cr2-codfw) [17:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:32] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::3:d [17:47:50] ignore the page, ulsfo is depooled [17:48:51] RECOVERY - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 50%, RTA = 78.88 ms [17:50:02] (03PS1) 10Andrew Bogott: Move labs-ns1 to 208.80.154.102 [dns] - 10https://gerrit.wikimedia.org/r/255714 (https://phabricator.wikimedia.org/T119762) [17:50:54] PROBLEM - HTTPS on cp4005 is CRITICAL: Return code of 255 is out of bounds [17:50:54] PROBLEM - salt-minion processes on bast4001 is CRITICAL: Timeout while attempting connection [17:50:55] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp4005 is CRITICAL: Timeout while attempting connection [17:50:55] PROBLEM - Varnishkafka log producer on cp4013 is CRITICAL: Timeout while attempting connection [17:50:55] PROBLEM - Varnish traffic logger - multicast_relay on cp4020 is CRITICAL: Timeout while attempting connection [17:50:55] PROBLEM - salt-minion processes on cp4018 is CRITICAL: Timeout while attempting connection [17:50:55] PROBLEM - dhclient process on cp4005 is CRITICAL: Timeout while attempting connection [17:50:55] PROBLEM - salt-minion processes on cp4019 is CRITICAL: Timeout while attempting connection [17:50:56] PROBLEM - Confd vcl based reload on cp4018 is CRITICAL: Timeout while attempting connection [17:50:56] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:57] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:57] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:58] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:58] PROBLEM - Host cr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [17:51:05] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 78.67 ms [17:51:24] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [17:51:24] PROBLEM - IPsec on cp4008 is CRITICAL: Timeout while attempting connection [17:51:24] PROBLEM - IPsec on cp4013 is CRITICAL: Timeout while attempting connection [17:51:24] PROBLEM - Disk space on cp4020 is CRITICAL: Timeout while attempting connection [17:51:24] PROBLEM - puppet last run on cp4005 is CRITICAL: Timeout while attempting connection [17:51:31] PROBLEM - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [17:51:31] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4017 is CRITICAL: Timeout while attempting connection [17:51:32] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:32] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:34] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:34] PROBLEM - IPsec on cp4020 is CRITICAL: Timeout while attempting connection [17:51:34] PROBLEM - IPsec on cp4014 is CRITICAL: Timeout while attempting connection [17:51:53] (ignore) [17:52:21] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [17:52:21] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:21] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:21] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:21] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:35] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 79.15 ms [17:52:35] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 78.90 ms [17:52:35] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 79.38 ms [17:52:35] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 79.74 ms [17:52:35] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 79.30 ms [17:52:43] (03CR) 10coren: [C: 031] Move labs-ns1 to 208.80.154.102 [dns] - 10https://gerrit.wikimedia.org/r/255714 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [17:52:44] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 78.75 ms [17:52:44] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 79.37 ms [17:52:44] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 79.74 ms [17:52:44] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 80.22 ms [17:52:44] RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 79.09 ms [17:53:04] PROBLEM - SSH on cp4017 is CRITICAL: Connection timed out [17:53:05] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:32] RECOVERY - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 560 bytes in 0.158 second response time [17:53:32] RECOVERY - Disk space on cp4020 is OK: DISK OK [17:53:32] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp4017 is OK: No errors detected [17:53:32] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 34 minutes ago with 0 failures [17:53:40] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:40] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 78.66 ms [17:53:40] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [17:54:09] also godog I may look at https://phabricator.wikimedia.org/T116031 for you ;) [17:54:11] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:11] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:11] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:37] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:40] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.194 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:54:58] PROBLEM - Host misc-web-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [17:55:31] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp4005 is OK: No errors detected [17:55:31] RECOVERY - Varnish traffic logger - multicast_relay on cp4020 is OK: PROCS OK: 1 process with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [17:55:31] RECOVERY - Varnishkafka log producer on cp4013 is OK: PROCS OK: 1 process with command name varnishkafka [17:55:31] RECOVERY - salt-minion processes on cp4018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:55:31] RECOVERY - dhclient process on cp4005 is OK: PROCS OK: 0 processes with command name dhclient [17:55:31] RECOVERY - salt-minion processes on cp4019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:55:32] RECOVERY - Confd vcl based reload on cp4018 is OK: reload-vcl successfully ran 99h, 0 minutes ago. [17:55:32] RECOVERY - salt-minion processes on bast4001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:55:40] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [17:55:51] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:51] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:51] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:51] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:51] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:51] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:51] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:52] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:52] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:53] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:53] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:54] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:54] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:55] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [17:56:07] addshore: oh! that'd be real nice, thanks! I'm looking at the diamond side and possibly done by tues I hope [17:56:11] RECOVERY - Host cp4014 is UP: PING WARNING - Packet loss = 80%, RTA = 81.87 ms [17:56:11] RECOVERY - Host cp4016 is UP: PING WARNING - Packet loss = 80%, RTA = 83.18 ms [17:56:19] (03PS2) 10Andrew Bogott: Move labs-ns1 to 208.80.154.102 [dns] - 10https://gerrit.wikimedia.org/r/255714 (https://phabricator.wikimedia.org/T119762) [17:56:20] addshore: https://phabricator.wikimedia.org/T116033 that is [17:56:20] PROBLEM - configured eth on lvs4002 is CRITICAL: Timeout while attempting connection [17:56:20] PROBLEM - Varnish traffic logger - multicast_relay on cp4013 is CRITICAL: Timeout while attempting connection [17:56:20] PROBLEM - Varnish HTCP daemon on cp4014 is CRITICAL: Timeout while attempting connection [17:56:20] PROBLEM - DPKG on cp4013 is CRITICAL: Timeout while attempting connection [17:56:20] PROBLEM - Varnish traffic logger - multicast_relay on cp4014 is CRITICAL: Timeout while attempting connection [17:56:20] PROBLEM - DPKG on cp4014 is CRITICAL: Timeout while attempting connection [17:56:21] PROBLEM - DPKG on cp4017 is CRITICAL: Timeout while attempting connection [17:56:21] PROBLEM - dhclient process on cp4013 is CRITICAL: Timeout while attempting connection [17:57:36] godog: oooh :) I just had a quick skim down the graphite list :) Also graphite & grafana is turning out to work brilliantly for my use case, and I am very glad it wasnt me causing the mass 500s ;) [18:00:24] also having some thoughts on caching of graphite data ;) would save graphite some load in some cases, particularly for some cases (such as my daily data) [18:05:17] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [18:05:17] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:41] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:19:41] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: (unnamed) not-conn: cp4005_v4, cp4006_v4, cp4007_v4, cp4013_v4, cp4014_v4, cp4015_v4 [18:19:41] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 20 connecting: (unnamed) not-conn: cp4011_v4, cp4012_v4, cp4019_v4, cp4020_v4 [18:19:41] RECOVERY - Host cp4011 is UP: PING WARNING - Packet loss = 64%, RTA = 79.14 ms [18:19:41] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [18:19:41] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [18:19:41] RECOVERY - IPsec on cp1057 is OK: Strongswan OK - 24 ESP OK [18:19:41] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 58 ESP OK [18:19:41] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [18:19:41] RECOVERY - Host cp4003 is UP: PING WARNING - Packet loss = 80%, RTA = 78.69 ms [18:19:41] RECOVERY - Host cp4015 is UP: PING WARNING - Packet loss = 80%, RTA = 78.69 ms [18:19:41] RECOVERY - Host cp4006 is UP: PING WARNING - Packet loss = 80%, RTA = 78.71 ms [18:19:41] RECOVERY - Host bast4001 is UP: PING WARNING - Packet loss = 80%, RTA = 80.23 ms [18:19:41] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [18:19:41] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4011_v6, cp4012_v6, cp4019_v6, cp4020_v6 [18:19:41] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [18:19:41] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [18:19:41] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [18:19:41] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [18:19:41] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [18:19:41] PROBLEM - IPsec on cp1056 is CRITICAL: Strongswan CRITICAL - ok: 20 not-conn: cp4001_v6, cp4002_v6, cp4003_v6, cp4004_v6 [18:19:41] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4008_v6, cp4009_v6, cp4010_v6, cp4016_v6, cp4017_v6, cp4018_v6 [18:19:41] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:41] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:41] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:42] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp4008_v6, cp4009_v6, cp4010_v6, cp4016_v6, cp4017_v6, cp4018_v6 [18:19:42] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4005_v6, cp4006_v6, cp4007_v6, cp4013_v6, cp4014_v6, cp4015_v6 [18:19:42] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 79.09 ms [18:19:42] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 78.82 ms [18:19:42] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 79.31 ms [18:20:14] (03CR) 10Andrew Bogott: [C: 032] Move labs-ns1 to 208.80.154.102 [dns] - 10https://gerrit.wikimedia.org/r/255714 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [18:20:39] PROBLEM - Varnishkafka log producer on cp4010 is CRITICAL: Timeout while attempting connection [18:20:39] PROBLEM - salt-minion processes on cp4018 is CRITICAL: Timeout while attempting connection [18:20:39] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp4010 is CRITICAL: Timeout while attempting connection [18:20:39] PROBLEM - Confd vcl based reload on cp4018 is CRITICAL: Timeout while attempting connection [18:20:48] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:48] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:58] PROBLEM - puppet last run on cp4010 is CRITICAL: Timeout while attempting connection [18:20:58] RECOVERY - Host cp4020 is UP: PING WARNING - Packet loss = 80%, RTA = 79.09 ms [18:20:58] RECOVERY - Host cp4003 is UP: PING WARNING - Packet loss = 61%, RTA = 79.32 ms [18:20:59] RECOVERY - SSH on lvs4002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [18:20:59] RECOVERY - Host lvs4003 is UP: PING WARNING - Packet loss = 54%, RTA = 78.74 ms [18:20:59] RECOVERY - Host cp4016 is UP: PING WARNING - Packet loss = 54%, RTA = 79.36 ms [18:20:59] RECOVERY - Host cp4005 is UP: PING WARNING - Packet loss = 54%, RTA = 79.41 ms [18:21:06] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 79.23 ms [18:21:06] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 79.61 ms [18:21:20] PROBLEM - dhclient process on cp4010 is CRITICAL: Timeout while attempting connection [18:21:20] PROBLEM - Varnish traffic logger - multicast_relay on cp4006 is CRITICAL: Timeout while attempting connection [18:21:20] PROBLEM - Varnishkafka log producer on cp4018 is CRITICAL: Timeout while attempting connection [18:21:28] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:28] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:28] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:28] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:28] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:38] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:47] RECOVERY - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 498 bytes in 0.159 second response time [18:21:47] RECOVERY - Varnish HTTP misc-frontend - port 80 on cp4004 is OK: HTTP OK: HTTP/1.1 200 OK - 320 bytes in 3.169 second response time [18:21:47] RECOVERY - DPKG on cp4007 is OK: All packages OK [18:21:47] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 490 bytes in 3.160 second response time [18:21:47] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 489 bytes in 3.161 second response time [18:21:47] RECOVERY - RAID on cp4004 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [18:21:48] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 48 minutes ago with 0 failures [18:21:48] RECOVERY - RAID on cp4007 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [18:21:49] RECOVERY - Host cp4007 is UP: PING WARNING - Packet loss = 93%, RTA = 80.77 ms [18:21:49] RECOVERY - Host cp4004 is UP: PING WARNING - Packet loss = 93%, RTA = 81.10 ms [18:21:49] RECOVERY - Host cp4008 is UP: PING WARNING - Packet loss = 86%, RTA = 84.65 ms [18:21:50] RECOVERY - Host cp4019 is UP: PING WARNING - Packet loss = 93%, RTA = 79.67 ms [18:22:08] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:29] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:29] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:39] PROBLEM - SSH on cp4017 is CRITICAL: Connection timed out [18:22:39] PROBLEM - HTTPS on cp4006 is CRITICAL: Return code of 255 is out of bounds [18:22:49] PROBLEM - Varnishkafka log producer on cp4006 is CRITICAL: Timeout while attempting connection [18:22:59] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4006 is CRITICAL: Connection timed out [18:22:59] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp4005_v4, cp4005_v6, cp4006_v4, cp4006_v6, cp4007_v4, cp4007_v6, cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [18:22:59] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp4011_v4, cp4011_v6, cp4012_v4, cp4012_v6, cp4019_v4, cp4019_v6, cp4020_v4, cp4020_v6 [18:23:08] PROBLEM - Varnish HTCP daemon on cp4017 is CRITICAL: Timeout while attempting connection [18:23:08] PROBLEM - salt-minion processes on cp4006 is CRITICAL: Timeout while attempting connection [18:23:08] PROBLEM - IPsec on cp1057 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp4001_v4, cp4001_v6, cp4002_v4, cp4002_v6, cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:23:08] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp4008_v4, cp4008_v6, cp4009_v4, cp4009_v6, cp4010_v4, cp4010_v6, cp4016_v4, cp4016_v6, cp4017_v4, cp4017_v6, cp4018_v4, cp4018_v6 [18:23:08] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp4008_v4, cp4008_v6, cp4009_v4, cp4009_v6, cp4010_v4, cp4010_v6, cp4016_v4, cp4016_v6, cp4017_v4, cp4017_v6, cp4018_v4, cp4018_v6 [18:23:09] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp4008_v4, cp4008_v6, cp4009_v4, cp4009_v6, cp4010_v4, cp4010_v6, cp4016_v4, cp4016_v6, cp4017_v4, cp4017_v6, cp4018_v4, cp4018_v6 [18:23:20] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp4005_v4, cp4005_v6, cp4006_v4, cp4006_v6, cp4007_v4, cp4007_v6, cp4013_v4, cp4013_v6, cp4014_v4, cp4014_v6, cp4015_v4, cp4015_v6 [18:23:28] PROBLEM - Varnishkafka log producer on cp4017 is CRITICAL: Timeout while attempting connection [18:23:28] PROBLEM - Varnish traffic logger - multicast_relay on cp4017 is CRITICAL: Timeout while attempting connection [18:23:28] PROBLEM - Confd vcl based reload on cp4002 is CRITICAL: Timeout while attempting connection [18:23:28] PROBLEM - Freshness of OCSP Stapling files on cp4002 is CRITICAL: Timeout while attempting connection [18:23:28] PROBLEM - IPsec on cp4017 is CRITICAL: Timeout while attempting connection [18:23:28] PROBLEM - Freshness of OCSP Stapling files on cp4017 is CRITICAL: Timeout while attempting connection [18:23:28] PROBLEM - DPKG on cp4017 is CRITICAL: Timeout while attempting connection [18:23:29] PROBLEM - RAID on cp4002 is CRITICAL: Timeout while attempting connection [18:23:29] PROBLEM - RAID on cp4017 is CRITICAL: Timeout while attempting connection [18:23:30] PROBLEM - IPsec on cp1069 is CRITICAL: Strongswan CRITICAL - ok: 16 not-conn: cp4001_v4, cp4001_v6, cp4002_v4, cp4002_v6, cp4003_v4, cp4003_v6, cp4004_v4, cp4004_v6 [18:23:30] PROBLEM - confd service on cp4017 is CRITICAL: Timeout while attempting connection [18:23:38] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp4017 is CRITICAL: Timeout while attempting connection [18:23:38] PROBLEM - Varnishkafka log producer on cp4002 is CRITICAL: Timeout while attempting connection [18:23:38] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [18:23:45] RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 10192 bytes in 4.834 second response time [18:23:46] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 16%, RTA = 80.97 ms [18:23:47] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 79.30 ms [18:23:48] RECOVERY - HTTPS on cp4018 is OK: SSLXNN OK - 36 OK [18:23:48] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 78.70 ms [18:24:06] RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 577 bytes in 0.160 second response time [18:24:06] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 79.00 ms [18:24:06] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 79.35 ms [18:24:06] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 26 ESP OK [18:24:06] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 43 minutes ago with 0 failures [18:24:06] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 80.13 ms [18:24:06] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 78.88 ms [18:24:07] RECOVERY - IPsec on cp1056 is OK: Strongswan OK - 24 ESP OK [18:24:14] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 79.84 ms [18:24:14] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 58 ESP OK [18:24:21] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15077 bytes in 8.633 second response time [18:24:22] RECOVERY - Host cp4009 is UP: PING WARNING - Packet loss = 73%, RTA = 78.90 ms [18:24:22] RECOVERY - Host cp4012 is UP: PING WARNING - Packet loss = 73%, RTA = 79.54 ms [18:24:22] RECOVERY - Host cp4010 is UP: PING WARNING - Packet loss = 80%, RTA = 78.77 ms [18:24:29] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 78.71 ms [18:24:38] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 1%, RTA = 78.74 ms [18:25:57] PROBLEM - puppet last run on cp4005 is CRITICAL: Timeout while attempting connection [18:25:58] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:07] PROBLEM - salt-minion processes on cp4016 is CRITICAL: Timeout while attempting connection [18:26:07] PROBLEM - Varnish HTCP daemon on cp4016 is CRITICAL: Timeout while attempting connection [18:26:07] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp4016 is CRITICAL: Timeout while attempting connection [18:26:07] PROBLEM - dhclient process on cp4011 is CRITICAL: Timeout while attempting connection [18:26:07] PROBLEM - IPsec on cp4011 is CRITICAL: Timeout while attempting connection [18:26:07] PROBLEM - Varnishkafka log producer on cp4016 is CRITICAL: Timeout while attempting connection [18:26:08] PROBLEM - Varnish HTCP daemon on cp4011 is CRITICAL: Timeout while attempting connection [18:26:08] PROBLEM - dhclient process on lvs4002 is CRITICAL: Timeout while attempting connection [18:26:09] PROBLEM - configured eth on lvs4002 is CRITICAL: Timeout while attempting connection [18:26:09] PROBLEM - dhclient process on lvs4003 is CRITICAL: Timeout while attempting connection [18:26:10] PROBLEM - salt-minion processes on lvs4003 is CRITICAL: Timeout while attempting connection [18:26:10] PROBLEM - Varnish traffic logger - multicast_relay on cp4016 is CRITICAL: Timeout while attempting connection [18:26:11] PROBLEM - configured eth on lvs4003 is CRITICAL: Timeout while attempting connection [18:26:11] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4011 is CRITICAL: Timeout while attempting connection [18:26:28] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4005_v4, cp4006_v4, cp4007_v4, cp4013_v4, cp4014_v4, cp4015_v4 [18:26:29] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:37] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [18:26:38] PROBLEM - dhclient process on cp4017 is CRITICAL: Timeout while attempting connection [18:26:48] RECOVERY - Host cr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 79.16 ms [18:26:48] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 [18:26:58] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 58 ESP OK [18:26:58] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [18:26:58] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 60 ESP OK [18:26:58] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 79.43 ms [18:27:07] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 79.23 ms [18:27:08] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 79.02 ms [18:27:08] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [18:27:08] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [18:27:08] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [18:27:17] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp4018 is OK: No errors detected [18:27:18] RECOVERY - SSH on cp4017 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [18:27:28] RECOVERY - IPsec on cp1070 is OK: Strongswan OK - 24 ESP OK [18:27:28] RECOVERY - HTTPS on cp4010 is OK: SSLXNN OK - 36 OK [18:27:28] RECOVERY - HTTPS on cp4006 is OK: SSLXNN OK - 36 OK [18:27:28] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 58 ESP OK [18:27:28] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 60 ESP OK [18:27:28] RECOVERY - Varnishkafka log producer on cp4006 is OK: PROCS OK: 1 process with command name varnishkafka [18:27:29] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 60 ESP OK [18:27:29] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp4010 is OK: No errors detected [18:27:30] RECOVERY - salt-minion processes on cp4018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:27:30] RECOVERY - Varnishkafka log producer on cp4010 is OK: PROCS OK: 3 processes with command name varnishkafka [18:27:30] RECOVERY - Confd vcl based reload on cp4018 is OK: reload-vcl successfully ran 99h, 32 minutes ago. [18:27:37] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [18:27:38] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 58 ESP OK [18:27:38] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.158 second response time [18:27:47] RECOVERY - Varnish HTCP daemon on cp4017 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [18:27:47] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 60 ESP OK [18:27:47] RECOVERY - salt-minion processes on cp4006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:27:47] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [18:27:48] RECOVERY - IPsec on cp1057 is OK: Strongswan OK - 24 ESP OK [18:27:49] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 58 ESP OK [18:27:49] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 58 ESP OK [18:27:49] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 16 ESP OK [18:27:49] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 26 ESP OK [18:27:58] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 58 ESP OK [18:28:06] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 78.97 ms [18:28:07] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [18:28:17] RECOVERY - salt-minion processes on cp4016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:28:17] RECOVERY - Varnish HTCP daemon on cp4016 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [18:28:17] RECOVERY - Varnish traffic logger - multicast_relay on cp4017 is OK: PROCS OK: 1 process with args varnishncsa-multicast_relay.pid, UID = 111 (varnishlog) [18:28:17] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp4016 is OK: No errors detected [18:28:17] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 8 ESP OK [18:28:17] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK [18:28:17] RECOVERY - Varnishkafka log producer on cp4016 is OK: PROCS OK: 3 processes with command name varnishkafka [18:28:33] o.O [18:28:50] lot's of solved problems ;) [18:29:36] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: puppet fail [18:30:15] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [18:30:24] RECOVERY - Host backup4001 is UP: PING OK - Packet loss = 0%, RTA = 79.50 ms [18:30:44] not actual problems, don't worry [18:30:54] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [18:30:55] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [18:30:55] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:31:54] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 79.72 ms [18:33:05] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [18:33:05] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [18:33:14] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [18:33:34] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [18:34:56] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: puppet fail [18:35:14] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:35:25] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [18:37:22] RECOVERY - Host misc-web-lb.ulsfo.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 80.53 ms [18:37:53] RECOVERY - Host misc-web-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 78.75 ms [18:39:15] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:39:16] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [18:42:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [18:43:05] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [18:46:45] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:46:46] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:48:45] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:49:04] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:49:04] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:49:09] I’m going to upgrade to El Capitan. If I don’t return, tell my story. [18:49:25] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:50:24] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:52:35] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:52:45] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:55:26] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:57:55] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:58:35] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:58:44] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:58:44] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:02:44] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:03:04] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:04:35] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:04:56] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:06:35] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:06:35] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:08:15] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:08:26] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:09:35] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:16:10] godog: putting a patch up for the mediawiki thing now ;) [19:22:02] https://gerrit.wikimedia.org/r/#/c/255720/ [19:52:57] 6operations, 10MediaWiki-General-or-Unknown, 7Graphite, 5Patch-For-Review: mediawiki should send statsd metrics in batches - https://phabricator.wikimedia.org/T116031#1835437 (10Legoktm) [20:30:34] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:31:35] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:50:12] 6operations, 6Commons: image magick striping colour profile of PNG files [probably regression] - https://phabricator.wikimedia.org/T113123#1835500 (10Krenair) [21:07:21] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, 10Traffic: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1835539 (10RP88) Bug is still being reported. Latest incident: https://commons.wikimedia.org/wiki/Commons:Help_desk#Deletion_of... [21:23:07] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/255726/1 (duration: 00m 31s) [21:23:10] gerrit bot having issues? [21:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:25:32] Krenair: thanks [21:25:39] np [21:27:23] 6operations, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#1835575 (10ArielGlenn) since no one from analytics noticed (silence = consent) I'll go ahead and do this the way described above. [21:29:14] 6operations, 10netops, 7Monitoring: Juniper monitoring - https://phabricator.wikimedia.org/T83992#1835586 (10faidon) [21:29:15] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail [21:56:15] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:27:51] yuvipanda, aware of the gerrit bot issues? [22:28:05] vaguely, Krenair [22:28:44] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: puppet fail [22:57:44] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures