[00:04:35] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational [00:07:35] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:39:35] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 27 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:44:35] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:22:08] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 53s) [02:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:08] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri May 19 02:28:08 UTC 2017 (duration 6m 0s) [02:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:05] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1119.10 Read Requests/Sec=6287.00 Write Requests/Sec=0.70 KBytes Read/Sec=29477.20 KBytes_Written/Sec=14.40 [04:18:05] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.40 Read Requests/Sec=222.80 Write Requests/Sec=114.50 KBytes Read/Sec=3190.00 KBytes_Written/Sec=795.20 [05:04:25] PROBLEM - SSH on ms-be1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:04:35] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:04:35] PROBLEM - MD RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:04:45] PROBLEM - configured eth on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:04:45] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:15] RECOVERY - SSH on ms-be1019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [05:05:25] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1019 is OK: OK ferm input default policy is set [05:05:26] RECOVERY - MD RAID on ms-be1019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [05:05:35] RECOVERY - configured eth on ms-be1019 is OK: OK - interfaces up [05:05:35] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [05:13:05] PROBLEM - Apache HTTP on mw1195 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [05:14:05] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.178 second response time [05:56:34] !log shutting down db2049 and preparing it for reimage [05:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:40] (03PS1) 10Marostegui: db-eqiad.php: Repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354381 (https://phabricator.wikimedia.org/T162611) [06:05:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354381 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:06:37] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354381 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:06:46] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354381 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui) [06:07:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 - T162611 (duration: 00m 40s) [06:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:47] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [06:08:32] (03PS1) 10Jcrespo: mariadb: allow reimage of db1049 for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/354382 [06:08:59] (03CR) 10Marostegui: "db1049 or db2049? :-)" [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo) [06:09:44] !log Deploy alter table s2.revision table - db1018 - https://phabricator.wikimedia.org/T162611 [06:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:36] (03PS2) 10Jcrespo: mariadb: allow reimage of db2049 for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/354382 [06:10:52] (03CR) 10Marostegui: [C: 031] mariadb: allow reimage of db2049 for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo) [06:11:22] (03CR) 10Jcrespo: "This was totally a test to see if you were paying attention- because of course I never make mistakes." [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo) [06:11:33] (03CR) 10Jcrespo: ":-)" [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo) [06:11:57] (03CR) 10Marostegui: [C: 031] "hahahaha" [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo) [06:12:00] (03CR) 10Jcrespo: [C: 032] mariadb: allow reimage of db2049 for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo) [06:14:01] (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354383 (https://phabricator.wikimedia.org/T159753) [06:15:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354383 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [06:16:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354383 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [06:16:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354383 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui) [06:23:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T159753 T164530 (duration: 00m 39s) [06:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:43] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [06:23:43] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [06:29:15] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [06:31:15] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354384 [06:32:15] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:32:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354384 (owner: 10Marostegui) [06:33:50] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354384 (owner: 10Marostegui) [06:34:35] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational [06:34:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 - T159753 T164530 (duration: 00m 38s) [06:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:51] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [06:34:51] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [06:35:43] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354384 (owner: 10Marostegui) [06:37:35] PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:51:29] !log installing openjdk-7/trusty regression update [06:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:05] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:57:55] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [06:59:45] PROBLEM - Check Varnish expiry mailbox lag on cp4015 is CRITICAL: CRITICAL: expiry mailbox lag is 2008139 [07:16:45] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:21:15] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [07:21:15] RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational [07:24:15] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:24:15] PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:29:15] RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational [07:30:35] RECOVERY - puppet last run on kubernetes2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:35:15] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [07:35:35] RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational [07:35:45] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational [07:36:08] !log reboot kubernetes2001 for tests [07:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:35] PROBLEM - salt-minion processes on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:36] PROBLEM - swift-account-server on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:36] PROBLEM - swift-container-server on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:36:36] RECOVERY - puppet last run on kubernetes2004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:36:36] RECOVERY - puppet last run on kubernetes2003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:36:36] RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [07:37:25] RECOVERY - swift-container-server on ms-be1020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [07:37:25] RECOVERY - salt-minion processes on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:37:26] RECOVERY - swift-account-server on ms-be1020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [07:37:35] PROBLEM - Host kubernetes2001 is DOWN: PING CRITICAL - Packet loss = 100% [07:37:55] RECOVERY - Host kubernetes2001 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [07:43:59] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3274706 (10akosiaris) And hosts are now fully up and running, will resolve this. Thanks @Papaul [07:44:14] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3274707 (10akosiaris) 05Open>03Resolved [07:45:45] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:49:45] RECOVERY - Check Varnish expiry mailbox lag on cp4015 is OK: OK: expiry mailbox lag is 2 [08:06:23] (03PS1) 10Jcrespo: mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 [08:07:31] (03CR) 10Marostegui: [C: 031] mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 (owner: 10Jcrespo) [08:08:40] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3274759 (10MoritzMuehlenhoff) I dug a little deeper and this turned out to be a subtle packaging bug / debhelper oddity: The current debian/rules file uses: ``` dh $@ --with autoreconf --with-systemd ``` Looks... [08:10:23] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 (owner: 10Jcrespo) [08:11:32] (03Merged) 10jenkins-bot: mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 (owner: 10Jcrespo) [08:11:43] (03CR) 10jenkins-bot: mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 (owner: 10Jcrespo) [08:14:31] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2048 for reimage (duration: 00m 39s) [08:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:05] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:38:25] PROBLEM - SSH on ms-be1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:55] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [08:39:04] (03PS1) 10Jcrespo: mariadb: allow reimage of db2048 for upgrade to jessie [puppet] - 10https://gerrit.wikimedia.org/r/354388 [08:39:15] RECOVERY - SSH on ms-be1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [08:47:23] (03PS1) 10Jforrester: Enable TimedMediaHandler's new video player Beta Feature in Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354389 (https://phabricator.wikimedia.org/T148103) [08:47:25] (03PS1) 10Jforrester: Enable TimedMediaHandler's new video player Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354390 (https://phabricator.wikimedia.org/T148103) [08:58:57] (03CR) 10Jcrespo: [C: 04-1] "Not yet- issues on db2049 reimage :-/" [puppet] - 10https://gerrit.wikimedia.org/r/354388 (owner: 10Jcrespo) [09:01:23] !log reedy@tin Synchronized php-1.30.0-wmf.1/extensions/WikimediaMaintenance/makeSizeDBLists.php: Catch a silly error (duration: 00m 39s) [09:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:15] (03PS1) 10Reedy: Update dblists! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354435 [09:11:25] Amazing [09:11:36] 1 wiki moved to large [09:11:49] One moved medium to small [09:11:55] Or 2 even? [09:11:56] Some missing? [09:12:18] 30+ moving to medium from small [09:12:51] ITS GROWING!!! [09:13:03] what is? [09:13:12] Wikis [09:13:20] (03CR) 10Reedy: [C: 032] Update dblists! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354435 (owner: 10Reedy) [09:13:45] did I miss something while I was disconnected? [09:13:47] oh! [09:13:53] heh [09:14:36] what are your large medium and small lists used for? [09:14:50] (03Merged) 10jenkins-bot: Update dblists! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354435 (owner: 10Reedy) [09:14:54] We use them for some cronjob stuffs [09:15:48] !log reedy@tin Synchronized dblists/: Update size dblists (duration: 00m 39s) [09:15:53] (03CR) 10jenkins-bot: Update dblists! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354435 (owner: 10Reedy) [09:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:07] Time to turn off miser mode *runs* >.> <.< [09:24:03] Nemo_bis: hello. Why are you reverting all l10n updates??? [09:24:40] Nemo_bis: that overloads CI and the few reverts I have seen revert bunch of other strings [09:25:03] Nemo_bis: probably nicer to just fix l10n-bot and wait for tonight for it to magically update the repos [09:38:43] (03PS2) 10Jcrespo: mariadb: allow reimage of db2048 for upgrade to jessie [puppet] - 10https://gerrit.wikimedia.org/r/354388 [09:38:45] (03PS1) 10Jcrespo: mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438 [09:39:49] Reedy, where is cebwiki? [09:40:17] because for some table, it could be the largest wiki of all (templatelinks) [09:41:32] jynus, I'm not reedy, but cebwiki is in large.dblist [09:41:56] good [09:42:05] :) [09:43:00] it is nice for once to have lots of people in my timezone :-) [09:43:42] (03CR) 10Jcrespo: [C: 032] mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438 (owner: 10Jcrespo) [09:44:05] (03PS2) 10Jcrespo: mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438 [09:49:06] (03PS3) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) [09:51:05] (03PS4) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) [09:51:43] jynus: don't get used to it ;) [09:53:39] (03CR) 10Thcipriani: [C: 031] "Deployed in beta, puppet compiler happy: https://puppet-compiler.wmflabs.org/6488/ should be ready to whenever" [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [10:05:49] (03PS1) 10Volans: CLI: add -i/--interactive option [software/cumin] - 10https://gerrit.wikimedia.org/r/354442 [10:07:08] !log rebooting mw2220/mw2221 for update to Linux 4.9 / HHVM 3.18 / nutcracker tests [10:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:28] <_joe_> !log moved stale repos to /srv/deployment/STALE on tin, T129290 [10:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:36] T129290: [keyresult] Migrate remaining trebuchet deployed services - https://phabricator.wikimedia.org/T129290 [10:14:55] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] [10:16:36] 06Operations, 05Goal, 07kubernetes: Assigning IP space for kubernetes pod IPs - https://phabricator.wikimedia.org/T165732#3275340 (10akosiaris) [10:17:15] Zuul Gearman alert is legit. There is a large amount of mediawiki extensions changes in the pipes right now [10:18:27] 06Operations, 07Technical-Debt: Supersede RT tickets references - https://phabricator.wikimedia.org/T165733#3275388 (10Dereckson) [10:18:35] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [140.0] amusso Lot of mw extensions l10n reverts going on. [10:19:49] !log powercycling mw2221, stuck in reboot and serial console unresponsive [10:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:10] (03CR) 10Jcrespo: [C: 032] mariadb: allow reimage of db2048 for upgrade to jessie [puppet] - 10https://gerrit.wikimedia.org/r/354388 (owner: 10Jcrespo) [10:20:33] 06Operations, 05Goal, 07kubernetes: Assigning IP space for kubernetes pod IPs - https://phabricator.wikimedia.org/T165732#3275406 (10akosiaris) [10:20:56] arg, I merged the wrong patch [10:20:56] hashar: from extreamly vague memory the changes need to be reverted for it to work compared to just doing the fixes on translatewiki [10:21:05] (03PS1) 10Jcrespo: Revert "mariadb: allow reimage of db2048 for upgrade to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/354445 [10:21:16] p858snake: so I guess there is a good reason for the revert :D [10:21:21] (03PS3) 10Jcrespo: mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438 [10:22:36] (03Abandoned) 10Dereckson: toollabs: use UNIX agnostic shebang [puppet] - 10https://gerrit.wikimedia.org/r/327709 (owner: 10Dereckson) [10:23:33] 06Operations, 10ops-codfw: mw2221 stuck after reboot - https://phabricator.wikimedia.org/T165734#3275446 (10MoritzMuehlenhoff) [10:23:39] 06Operations, 10ops-codfw: mw2221 stuck after reboot - https://phabricator.wikimedia.org/T165734#3275460 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:24:39] ACKNOWLEDGEMENT - Host mw2221 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T165734 [10:30:56] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3275482 (10Ottomata) I have a question about the new profile guidelines: > Profile classes should only have parameters that default to an explicit hier... [10:31:35] !log restarting elsaticsearch on relforge1001 to pull in remote reindex [10:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:33] 06Operations, 10Analytics, 10Traffic, 15User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3275487 (10elukey) [10:33:54] 06Operations, 10Analytics, 10Traffic, 15User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3275506 (10elukey) [10:34:02] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438 (owner: 10Jcrespo) [10:34:26] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mariadb: allow reimage of db2048 for upgrade to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/354445 (owner: 10Jcrespo) [10:34:30] (03PS2) 10Jcrespo: Revert "mariadb: allow reimage of db2048 for upgrade to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/354445 [10:34:34] (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mariadb: allow reimage of db2048 for upgrade to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/354445 (owner: 10Jcrespo) [10:41:35] (03PS1) 10Jcrespo: Revert "Revert "mariadb: allow reimage of db2048 for upgrade to jessie"" [puppet] - 10https://gerrit.wikimedia.org/r/354448 [10:41:52] (03CR) 10Jcrespo: [C: 04-2] "Not ready yet." [puppet] - 10https://gerrit.wikimedia.org/r/354448 (owner: 10Jcrespo) [10:46:43] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic, 15User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3275543 (10elukey) [10:49:53] (03PS1) 10Elukey: [WIP] Refactor zookeeper roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449 [10:54:39] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3275550 (10MoritzMuehlenhoff) The new package fixes that (tested by rebooting two servers with and without the new package:) root@mw2222:~# dpkg -l nutcracker ii nutcracker 0.4.1-1+wm3... [11:05:09] !log uploaded nutcracker 0.4.1-1+wm3~jessie1 to apt.wikimedia.org (T163795) [11:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:19] T163795: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795 [11:06:29] (03PS3) 10Alexandros Kosiaris: Assign the kubernetes pod IPs in DNS [dns] - 10https://gerrit.wikimedia.org/r/341794 (https://phabricator.wikimedia.org/T165732) [11:11:31] (03CR) 10Alexandros Kosiaris: [C: 031] "And with some more change now we have:" [dns] - 10https://gerrit.wikimedia.org/r/341794 (https://phabricator.wikimedia.org/T165732) (owner: 10Alexandros Kosiaris) [11:11:35] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:25] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1019 is OK: OK ferm input default policy is set [11:12:56] 06Operations, 05Goal, 13Patch-For-Review, 07kubernetes: Assigning IP space for kubernetes pod IPs - https://phabricator.wikimedia.org/T165732#3275571 (10akosiaris) With the patch above we have: * production clusters (codfw + eqiad) IPv4, IPv6 pod IPs assigned * staging cluster (eqiad) IPv4, IPv6 pod IPs a... [11:16:31] (03PS1) 10Elukey: Remove any reference of mc1001->mc1018 for decom [puppet] - 10https://gerrit.wikimedia.org/r/354453 (https://phabricator.wikimedia.org/T164341) [11:17:07] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3275580 (10jcrespo) [11:18:06] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3275594 (10jcrespo) We tested installing trusty and it worked ok. > it might a case of the newer kernel having more stringent checks, which expose some hardwar... [11:31:03] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3275616 (10jcrespo) No relevant recent logs: ``` 7 Caution POST Message 11/16/2016 17:07 11/16/2016 17:07 1 POST Error: 1792-Slot X Drive Array - Valid Data Found in C... [11:32:55] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [11:39:03] !log Deploy alter table s2.revision table on labsdb1003 - T162611 [11:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:12] T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611 [11:56:32] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team (Next), 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3275656 (10greg) [12:02:54] (03PS2) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [WiP] [debs/pybal] - 10https://gerrit.wikimedia.org/r/302882 [12:10:58] (03PS5) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) [12:15:57] (03PS6) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) [12:16:55] (03PS2) 10Filippo Giunchedi: prometheus: report puppet agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354007 [12:16:58] (03PS1) 10Filippo Giunchedi: base: report prometheus agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354457 [12:35:31] (03PS1) 10Filippo Giunchedi: prometheus: add alertmanager_url to prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/354459 [12:35:33] (03PS1) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460 [12:36:07] (03PS1) 10Marostegui: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354462 (https://phabricator.wikimedia.org/T165743) [12:37:46] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354462 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui) [12:40:08] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2068 - T165743 (duration: 00m 40s) [12:40:10] !log Deploy alter table s7.frwiktionary on db2068 - T165743 [12:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:17] T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743 [12:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:01] (03PS1) 10Giuseppe Lavagetto: Add notes on branching/releasing [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464 [12:42:57] <_joe_> ema: ^^ [12:50:31] (03PS1) 10Thcipriani: Deployment via scap3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/354466 (https://phabricator.wikimedia.org/T165748) [12:51:13] jouncebot: refresh [12:51:15] I refreshed my knowledge about deployments. [12:53:42] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354467 [12:56:44] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354467 (owner: 10Marostegui) [12:57:49] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354467 (owner: 10Marostegui) [12:57:59] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354467 (owner: 10Marostegui) [12:58:39] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2068 - T165743 (duration: 00m 39s) [12:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:47] T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743 [13:08:33] (03PS1) 10Thcipriani: Scap3: deploy logstash/plugins with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748) [13:09:19] !log downgraded mw1161 to HHVM 3.12 (crashes often compared to app servers, downgrade over the weekend) [13:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:35] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:07] (03PS1) 10DCausse: [wikitech] Increase weight on Tool and Nova Resource ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354474 (https://phabricator.wikimedia.org/T165725) [13:26:01] (03PS1) 10Marostegui: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354476 (https://phabricator.wikimedia.org/T165743) [13:27:15] (03CR) 10Ema: [C: 031] "Looks good!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464 (owner: 10Giuseppe Lavagetto) [13:28:53] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354476 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui) [13:30:41] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354476 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui) [13:30:52] (03CR) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354476 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui) [13:31:32] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2047 - T165743 (duration: 00m 39s) [13:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:40] T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743 [13:34:50] (03PS2) 10Giuseppe Lavagetto: Add notes on branching/releasing [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464 [13:39:35] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:41:05] (03PS1) 10Giuseppe Lavagetto: Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/354480 [13:41:07] (03PS1) 10Giuseppe Lavagetto: Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/354481 [13:47:14] !log Deploy alter table s7.frwiktionary db1033 - T165743 [13:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:22] T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743 [13:47:56] (03PS1) 10Chad: gerrit (2.13.8+git1-wmf.1) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/354485 [13:54:43] (03PS1) 10Marostegui: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354487 (https://phabricator.wikimedia.org/T165743) [13:56:03] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354487 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui) [13:57:00] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354487 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui) [13:57:09] (03CR) 10jenkins-bot: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354487 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui) [13:58:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2047 - T165743 (duration: 00m 38s) [13:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:13] T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743 [13:58:21] (03PS2) 10Chad: gerrit (2.13.8+git1-wmf.1) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/354485 (https://phabricator.wikimedia.org/T158946) [14:01:10] (03PS1) 10Hoo man: Use kill -- -$$ to kill a process group in dumpwikidata scripts [puppet] - 10https://gerrit.wikimedia.org/r/354489 [14:04:08] 06Operations, 10Pybal, 10Traffic: Fully-redundant LVS clusters using Pybal per-service MED feature - https://phabricator.wikimedia.org/T165764#3276237 (10BBlack) [14:09:56] (03PS1) 10Hoo man: dumpWikidata: Make the minimum shard size depend on the number of shards [puppet] - 10https://gerrit.wikimedia.org/r/354494 [14:10:36] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:11:45] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:35] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [14:20:26] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.15 seconds [14:20:52] 06Operations, 10Traffic: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765#3276305 (10BBlack) [14:21:39] 06Operations, 10Traffic: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765#3276336 (10BBlack) [14:22:15] PROBLEM - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused [14:22:25] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 349.14 seconds [14:22:35] PROBLEM - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:22:45] got that ^^ [14:22:45] PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:23:45] RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational [14:24:15] RECOVERY - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is OK: TCP OK - 0.036 second response time on 10.64.0.36 port 9042 [14:24:36] RECOVERY - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-a valid until 2018-01-05 22:53:02 +0000 (expires in 231 days) [14:30:49] (03CR) 10Paladox: [C: 031] gerrit (2.13.8+git1-wmf.1) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/354485 (https://phabricator.wikimedia.org/T158946) (owner: 10Chad) [14:34:48] (03PS3) 10Giuseppe Lavagetto: Add notes on branching/releasing [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464 [14:34:50] (03PS2) 10Giuseppe Lavagetto: Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/354480 [14:34:52] (03PS2) 10Giuseppe Lavagetto: Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/354481 [14:38:25] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 46.00 seconds [14:38:25] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.24 seconds [14:39:45] RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:42:42] (03CR) 10ArielGlenn: [C: 032] Use kill -- -$$ to kill a process group in dumpwikidata scripts [puppet] - 10https://gerrit.wikimedia.org/r/354489 (owner: 10Hoo man) [14:43:25] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.24 seconds [14:43:28] (03CR) 10ArielGlenn: [C: 032] dumpWikidata: Make the minimum shard size depend on the number of shards [puppet] - 10https://gerrit.wikimedia.org/r/354494 (owner: 10Hoo man) [14:43:38] (03CR) 10Giuseppe Lavagetto: [C: 032] Add notes on branching/releasing [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464 (owner: 10Giuseppe Lavagetto) [14:44:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/354480 (owner: 10Giuseppe Lavagetto) [14:45:25] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [14:45:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/354481 (owner: 10Giuseppe Lavagetto) [14:50:44] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3276493 (10jcrespo) 05declined>03Open This just happened again on s4. [14:52:37] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3224448 (10Marostegui) Some graphs that were shown while troublshooting https://grafa... [14:53:35] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2102826 [14:54:28] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3276528 (10jcrespo) p:05Low>03Triage This is probably not user-requested invalidat... [15:05:41] (03PS1) 10Ema: Bump version number in setup.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354502 [15:07:01] (03CR) 10Giuseppe Lavagetto: [C: 032] Bump version number in setup.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354502 (owner: 10Ema) [15:08:27] (03Abandoned) 10Ema: Bump version number in setup.py [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/354180 (owner: 10Ema) [15:08:38] (03PS1) 10Giuseppe Lavagetto: Split IPVS Manager into the interface and manager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354506 [15:08:40] (03PS1) 10Giuseppe Lavagetto: Add IPVSError as a generic IPVS-related exception [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354507 [15:08:41] (03PS1) 10Giuseppe Lavagetto: Add generic Finite States Machine [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354508 [15:08:44] (03PS1) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [WiP] [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354509 [15:08:48] (03Merged) 10jenkins-bot: Bump version number in setup.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354502 (owner: 10Ema) [15:09:18] (03Abandoned) 10Giuseppe Lavagetto: Split IPVS Manager into the interface and manager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/302434 (owner: 10Giuseppe Lavagetto) [15:16:28] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3276584 (10jcrespo) Without entering on heavy rearchitectures, we should, an probably... [15:23:40] (03PS1) 10Andrew Bogott: openstackclients: add an optional project arg to allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/354515 [15:23:42] (03PS1) 10Andrew Bogott: novastats: Update some reports to use more up-to-date code. [puppet] - 10https://gerrit.wikimedia.org/r/354516 [15:24:09] (03PS2) 10Ema: Split IPVS Manager into the interface and manager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354506 (owner: 10Giuseppe Lavagetto) [15:26:37] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3276610 (10BBlack) 05Resolved>03Open Not resolved, as the purge graphs can attest! [15:26:42] 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3276612 (10BBlack) [15:26:46] (03CR) 10jerkins-bot: [V: 04-1] novastats: Update some reports to use more up-to-date code. [puppet] - 10https://gerrit.wikimedia.org/r/354516 (owner: 10Andrew Bogott) [15:36:10] (03PS2) 10Dzahn: Planet: Delete sr.planet [puppet] - 10https://gerrit.wikimedia.org/r/350242 (owner: 10Chad) [15:36:45] 06Operations, 10ops-eqiad, 06Labs: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3276633 (10RobH) [15:37:15] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3276650 (10jcrespo) [15:38:29] (03CR) 10Dzahn: [C: 032] Planet: Delete sr.planet [puppet] - 10https://gerrit.wikimedia.org/r/350242 (owner: 10Chad) [15:39:06] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3276658 (10RobH) 05Open>03Resolved These systems have been ordered on T163822 and installation will progress on T165779. Resolving this request, as its being gran... [15:40:58] !log planet10001 - manually deleting cron job for deleted sr.planet (should puppetize the "absence" too) [15:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:02] (03PS2) 10Dzahn: Drop sr.planet from dns, it's moribund [dns] - 10https://gerrit.wikimedia.org/r/350243 (owner: 10Chad) [15:45:01] (03CR) 10Dzahn: [C: 032] Drop sr.planet from dns, it's moribund [dns] - 10https://gerrit.wikimedia.org/r/350243 (owner: 10Chad) [15:52:21] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3276700 (10RobH) [15:52:55] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3276717 (10RobH) 05Open>03Resolved These systems have been ordered on T163031 and will be setup on T165781. [15:53:15] mutante: RainbowSprinkles: you also dropped it from puppet repo? [15:53:29] (03PS2) 10Ema: Add IPVSError as a generic IPVS-related exception [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354507 (owner: 10Giuseppe Lavagetto) [15:55:19] (03PS2) 10Ema: Add generic Finite States Machine [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354508 (owner: 10Giuseppe Lavagetto) [15:55:28] (03PS2) 10Ema: Add netlink-based Ipvsmanager implementation [WiP] [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354509 (owner: 10Giuseppe Lavagetto) [15:56:51] Dereckson: yes, just merged that up there a few minutes earlier [15:57:28] ok [15:58:11] (03CR) 10Ema: [C: 031] Split IPVS Manager into the interface and manager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354506 (owner: 10Giuseppe Lavagetto) [15:58:21] (03CR) 10Ema: [C: 031] Add IPVSError as a generic IPVS-related exception [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354507 (owner: 10Giuseppe Lavagetto) [15:58:26] 06Operations, 10ops-eqiad, 06Labs, 10procurement: rack/setup/install labmon1003 - https://phabricator.wikimedia.org/T165784#3276770 (10RobH) [15:59:41] 06Operations, 10ops-eqiad, 06Labs, 10procurement: rack/setup/install labmon1003 - https://phabricator.wikimedia.org/T165784#3276770 (10RobH) Please note that once the initial onsite-specific steps are done (steps up do the network port setup), I can handle the operations/puppet repo updates and install the... [15:59:55] 06Operations, 10ops-eqiad, 06Labs: rack/setup/install labmon1003 - https://phabricator.wikimedia.org/T165784#3276806 (10RobH) [16:01:56] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3276828 (10jcrespo) Lots of category pages invalidations happening at that time: ``` UPDATE /* Title::invalidateCache */ `page` SE... [16:02:44] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for dedicated labmon1002 - https://phabricator.wikimedia.org/T161750#3276850 (10RobH) 05Open>03Resolved This has been ordered on T163808 and its setup will be tracked on T165784. [16:04:52] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3276894 (10jcrespo) For the long term: how useful is this field, and could it be separated from the rest of the table if it happens t... [16:07:20] Request from 217.196.74.137 via cp3041 cp3041, Varnish XID 186936691 [16:07:23] Error: 503, Service Unavailable at Fri, 19 May 2017 16:06:48 GMT [16:08:22] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3276925 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. Return information for bad disk attached. {F8123582} [16:23:09] _joe_: where can i find you? [16:23:13] 06Operations, 10ops-codfw: mw2221 stuck after reboot - https://phabricator.wikimedia.org/T165734#3277017 (10Papaul) a:05Papaul>03MoritzMuehlenhoff - Removed both PSU for about 5 minutes - Update the IDRAC firmware from 2.30 to 2.41 - update BIOS fro 1.6.1 to 2.4.2 System is back up again. [16:25:55] PROBLEM - Check whether ferm is active by checking the default input chain on mw2221 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:26:15] PROBLEM - Check systemd state on mw2221 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:28:39] i'm constantly getting 500 when trying to load my watchlist. for about 30 mins [16:30:07] Danny_B some of the ops are currently at the hackathon. [16:30:42] Danny_B where are you experencing this? [16:30:46] Works for me on https://en.wikipedia.org/wiki/Special:Watchlist [16:31:15] paladox: i know, but i can't find them [16:31:22] Oh [16:31:33] hence why i'm trying to poke it here [16:32:23] ok [16:33:47] (03PS2) 10Andrew Bogott: openstackclients: add an optional project arg to allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/354515 [16:33:49] (03PS2) 10Andrew Bogott: novastats: Update some reports to use more up-to-date code. [puppet] - 10https://gerrit.wikimedia.org/r/354516 [16:34:07] Danny_B is it intermittent? Could you create a task in phab please? [16:35:50] (03PS1) 10Reedy: Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 [16:36:39] Danny_B: They're hiding somewhere apparently [16:37:25] (03CR) 10jerkins-bot: [V: 04-1] Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 (owner: 10Reedy) [16:37:51] (03PS2) 10Reedy: Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 [16:38:55] Danny_B: Big watchlist? [16:40:19] Reedy: wouldn't say so... i don't remember particular number, but i guess less than 300 [16:40:33] it was working properly 5 days ago [16:41:18] (03CR) 10Reedy: [C: 032] Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 (owner: 10Reedy) [16:42:34] https://logstash.wikimedia.org/goto/a4f68641242e49dcadf151aa316f609e has some information related to Danny_B's issue [16:42:53] (03Merged) 10jenkins-bot: Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 (owner: 10Reedy) [16:43:06] (03CR) 10jenkins-bot: Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 (owner: 10Reedy) [16:44:32] https://gerrit.wikimedia.org/r/#/c/350914/ [16:44:39] possible fix for the issue ^^ [16:44:49] !log reedy@tin Synchronized wmf-config/throttle.php: Wikimedia Vienna Hackathon (duration: 00m 39s) [16:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:11] https://phabricator.wikimedia.org/T164059 [16:46:09] gwicke for the issue Danny_B is having? [16:46:20] yes [16:49:16] the issue is obviouslz with enhanced recent changes / watchlist [16:51:49] Wait is that issue still unresolved? [16:52:13] obviously yes [16:52:25] because normal watchlist works for me now [16:52:31] enhanced not [17:16:45] PROBLEM - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group [17:22:15] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:24:48] I'm just kind of surprised its not fixed, seemed kind of trivial [17:33:15] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:35:23] (03PS1) 10Ema: Set empty PYTHONPATH in tox.ini [debs/pybal] - 10https://gerrit.wikimedia.org/r/354547 [17:38:04] (03PS1) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 [17:39:32] (03CR) 10jerkins-bot: [V: 04-1] wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 (owner: 10Dzahn) [17:40:21] (03PS2) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 [17:41:10] (03PS3) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 [17:41:15] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:43:35] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 233882 [17:44:31] bawolff_: you are more than welcome to take it over... [17:52:13] (03PS1) 10Dereckson: Always show latest revision even if not reviewed on hu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354549 (https://phabricator.wikimedia.org/T121995) [18:14:25] PROBLEM - Check HHVM threads for leakage on mw1295 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [18:15:25] RECOVERY - Check HHVM threads for leakage on mw1295 is OK: OK [18:21:55] PROBLEM - MD RAID on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:22:45] RECOVERY - MD RAID on mw1294 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:28:45] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:35] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 80292 bytes in 1.303 second response time [18:31:55] PROBLEM - Check systemd state on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:32:45] RECOVERY - Check systemd state on mw1294 is OK: OK - running: The system is fully operational [19:10:17] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#3277249 (10Jgreen) [19:10:43] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2359459 (10Jgreen) [19:11:22] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#2359459 (10Jgreen) [19:13:11] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3277260 (10Jgreen) a:05Jgreen>03None This isn't something fr-tech-ops can fix, it's an external site. [19:13:36] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:14:35] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [19:19:43] 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3277275 (10Krinkle) [19:24:19] (03PS4) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 [19:28:42] (03PS5) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 [19:30:40] (03PS6) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 [19:34:59] (03PS7) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 [19:42:27] (03CR) 10Dzahn: [C: 032] wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 (owner: 10Dzahn) [19:56:05] (03PS1) 10Dzahn: wikistats: add missing .erb file extension to grants.sql [puppet] - 10https://gerrit.wikimedia.org/r/354567 [19:56:10] 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3277390 (10aaron) The query does not come from HTMLCacheUpdateJob (which calls HTMLCacheUpdateJob::invalidateTitles) or seemingly any... [19:56:37] (03CR) 10Dzahn: [V: 032 C: 032] wikistats: add missing .erb file extension to grants.sql [puppet] - 10https://gerrit.wikimedia.org/r/354567 (owner: 10Dzahn) [20:02:46] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3277398 (10aaron) a:05aaron>03None [20:03:54] 06Operations, 10DBA, 06Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3277400 (10aaron) [20:07:34] (03PS1) 10Dzahn: wikistats: fix typo in db.pp "requires" -> "require" [puppet] - 10https://gerrit.wikimedia.org/r/354568 [20:12:58] (03CR) 10Dzahn: [C: 032] wikistats: fix typo in db.pp "requires" -> "require" [puppet] - 10https://gerrit.wikimedia.org/r/354568 (owner: 10Dzahn) [20:43:52] (03PS1) 10Dereckson: Fix hy.wikipedia high resolution logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354586 (https://phabricator.wikimedia.org/T165811) [20:46:05] (03PS1) 10Dzahn: wikistats: use systemuser for git cloning [puppet] - 10https://gerrit.wikimedia.org/r/354588 [21:00:29] (03CR) 10Dzahn: [C: 032] wikistats: use systemuser for git cloning [puppet] - 10https://gerrit.wikimedia.org/r/354588 (owner: 10Dzahn) [21:05:47] (03PS1) 10Dzahn: wikistats: 'user' -> 'owner' parameter for /srv/wikistats [puppet] - 10https://gerrit.wikimedia.org/r/354591 [21:07:01] (03CR) 10Paladox: [C: 031] wikistats: 'user' -> 'owner' parameter for /srv/wikistats [puppet] - 10https://gerrit.wikimedia.org/r/354591 (owner: 10Dzahn) [21:07:16] (03CR) 10Dzahn: [C: 032] wikistats: 'user' -> 'owner' parameter for /srv/wikistats [puppet] - 10https://gerrit.wikimedia.org/r/354591 (owner: 10Dzahn) [21:17:35] PROBLEM - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdf] [21:24:23] (03PS1) 10Reedy: Remove wikimedia-periodic-update.sh [puppet] - 10https://gerrit.wikimedia.org/r/354596 [21:46:45] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:46:46] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:46:48] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:46:48] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:46:48] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:46:48] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:47:55] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:48:45] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [21:48:45] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [21:48:45] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [21:48:45] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [21:48:45] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [21:48:45] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [21:48:45] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [21:56:51] (03PS1) 10Nemo bis: Remove $wgEnableValidationStatisticsUpdates from FlaggedRevs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354600 [22:26:20] (03PS1) 10Dzahn: wikistats: puppetize missing dir in /usr/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/354603 [22:27:17] (03CR) 10jerkins-bot: [V: 04-1] wikistats: puppetize missing dir in /usr/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn) [22:27:26] (03CR) 10Paladox: wikistats: puppetize missing dir in /usr/lib/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn) [22:28:08] (03PS2) 10Dzahn: wikistats: puppetize missing dir in /usr/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/354603 [22:29:29] (03CR) 10Paladox: [C: 031] "lgtm :)" [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn) [22:33:24] (03CR) 10Dzahn: "actually.. no.. that should be the home dir of the system user so it should be created when the user gets created." [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn) [22:40:43] (03PS3) 10Dzahn: wikistats: ensure systemuser exists before backup dir [puppet] - 10https://gerrit.wikimedia.org/r/354603 [22:41:08] (03CR) 10Paladox: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn) [22:42:28] (03CR) 10Dzahn: [C: 032] wikistats: ensure systemuser exists before backup dir [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn) [22:58:13] (03PS2) 10Dereckson: Fix hy.wikipedia high resolution logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354586 (https://phabricator.wikimedia.org/T165811) [23:07:58] (03PS1) 10Nemo bis: Restore default $wgFlaggedRevsStatsAge (2 hours) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354608 (https://phabricator.wikimedia.org/T163107) [23:09:40] (03CR) 10Nemo bis: "Aaron, do you know what this was meant to do? See also f3ac9d067da1b8c27f94050cc9bc0251210d8415" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354608 (https://phabricator.wikimedia.org/T163107) (owner: 10Nemo bis) [23:09:46] (03Draft1) 10Paladox: Wikistats: Require /usr/lib/wikistats/schema.sql before executing mysql command [puppet] - 10https://gerrit.wikimedia.org/r/354609 [23:09:48] (03PS2) 10Paladox: Wikistats: Require /usr/lib/wikistats/schema.sql before executing mysql command [puppet] - 10https://gerrit.wikimedia.org/r/354609 [23:11:46] (03PS3) 10Paladox: wikistats: Require /usr/lib/wikistats/schema.sql before executing mysql command [puppet] - 10https://gerrit.wikimedia.org/r/354609 [23:13:26] (03CR) 10Dzahn: [C: 032] "yep, we needed that. thx" [puppet] - 10https://gerrit.wikimedia.org/r/354609 (owner: 10Paladox) [23:46:12] (03PS1) 10BryanDavis: Add Code of Conduct footer links to wikitech and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354612 [23:54:55] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:55:45] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:55:55] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:55:55] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:55:55] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:56:55] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:56:55] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [23:57:45] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [23:57:45] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [23:57:45] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [23:57:45] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [23:57:45] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [23:57:45] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [23:57:45] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy