[00:04:35] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational
[00:07:35] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[01:39:35] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 27 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[01:44:35] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 283 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[02:22:08] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 53s)
[02:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:28:08] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri May 19 02:28:08 UTC 2017 (duration 6m 0s)
[02:28:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:09:05] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1119.10 Read Requests/Sec=6287.00 Write Requests/Sec=0.70 KBytes Read/Sec=29477.20 KBytes_Written/Sec=14.40
[04:18:05] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.40 Read Requests/Sec=222.80 Write Requests/Sec=114.50 KBytes Read/Sec=3190.00 KBytes_Written/Sec=795.20
[05:04:25] <icinga-wm>	 PROBLEM - SSH on ms-be1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:04:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:04:35] <icinga-wm>	 PROBLEM - MD RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:04:45] <icinga-wm>	 PROBLEM - configured eth on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:04:45] <icinga-wm>	 PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:05:15] <icinga-wm>	 RECOVERY - SSH on ms-be1019 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0)
[05:05:25] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1019 is OK: OK ferm input default policy is set
[05:05:26] <icinga-wm>	 RECOVERY - MD RAID on ms-be1019 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[05:05:35] <icinga-wm>	 RECOVERY - configured eth on ms-be1019 is OK: OK - interfaces up
[05:05:35] <icinga-wm>	 RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures
[05:13:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1195 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[05:14:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.178 second response time
[05:56:34] <jynus>	 !log shutting down db2049 and preparing it for reimage
[05:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:40] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354381 (https://phabricator.wikimedia.org/T162611)
[06:05:29] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354381 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[06:06:37] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354381 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[06:06:46] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354381 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[06:07:39] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 - T162611 (duration: 00m 40s)
[06:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:47] <stashbot>	 T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611
[06:08:32] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: allow reimage of db1049 for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/354382
[06:08:59] <wikibugs>	 (03CR) 10Marostegui: "db1049 or db2049? :-)" [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo)
[06:09:44] <marostegui>	 !log Deploy alter table s2.revision table - db1018 - https://phabricator.wikimedia.org/T162611
[06:09:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:36] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: allow reimage of db2049 for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/354382
[06:10:52] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: allow reimage of db2049 for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo)
[06:11:22] <wikibugs>	 (03CR) 10Jcrespo: "This was totally a test to see if you were paying attention- because of course I never make mistakes." [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo)
[06:11:33] <wikibugs>	 (03CR) 10Jcrespo: ":-)" [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo)
[06:11:57] <wikibugs>	 (03CR) 10Marostegui: [C: 031] "hahahaha" [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo)
[06:12:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: allow reimage of db2049 for jessie upgrade [puppet] - 10https://gerrit.wikimedia.org/r/354382 (owner: 10Jcrespo)
[06:14:01] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354383 (https://phabricator.wikimedia.org/T159753)
[06:15:42] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354383 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[06:16:42] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354383 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[06:16:50] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1051 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354383 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[06:23:33] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1051 - T159753 T164530 (duration: 00m 39s)
[06:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:43] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[06:23:43] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[06:29:15] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[06:31:15] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354384
[06:32:15] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:32:50] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354384 (owner: 10Marostegui)
[06:33:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354384 (owner: 10Marostegui)
[06:34:35] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational
[06:34:42] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1051 - T159753 T164530 (duration: 00m 38s)
[06:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:34:51] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[06:34:51] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[06:35:43] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1051" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354384 (owner: 10Marostegui)
[06:37:35] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:51:29] <moritzm>	 !log installing openjdk-7/trusty regression update
[06:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:05] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:57:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set
[06:59:45] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp4015 is CRITICAL: CRITICAL: expiry mailbox lag is 2008139
[07:16:45] <icinga-wm>	 PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:21:15] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[07:21:15] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational
[07:24:15] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:24:15] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:29:15] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2001 is OK: OK - running: The system is fully operational
[07:30:35] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[07:35:15] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[07:35:35] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational
[07:35:45] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational
[07:36:08] <akosiaris>	 !log reboot kubernetes2001 for tests
[07:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:35] <icinga-wm>	 PROBLEM - salt-minion processes on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:36:36] <icinga-wm>	 PROBLEM - swift-account-server on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:36:36] <icinga-wm>	 PROBLEM - swift-container-server on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[07:36:36] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2004 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[07:36:36] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[07:36:36] <icinga-wm>	 RECOVERY - puppet last run on kubernetes2002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[07:37:25] <icinga-wm>	 RECOVERY - swift-container-server on ms-be1020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[07:37:25] <icinga-wm>	 RECOVERY - salt-minion processes on ms-be1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[07:37:26] <icinga-wm>	 RECOVERY - swift-account-server on ms-be1020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[07:37:35] <icinga-wm>	 PROBLEM - Host kubernetes2001 is DOWN: PING CRITICAL - Packet loss = 100%
[07:37:55] <icinga-wm>	 RECOVERY - Host kubernetes2001 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[07:43:59] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3274706 (10akosiaris) And hosts are now fully up and running, will resolve this. Thanks @Papaul
[07:44:14] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3274707 (10akosiaris) 05Open>03Resolved
[07:45:45] <icinga-wm>	 RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[07:49:45] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp4015 is OK: OK: expiry mailbox lag is 2
[08:06:23] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386
[08:07:31] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 (owner: 10Jcrespo)
[08:08:40] <wikibugs>	 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3274759 (10MoritzMuehlenhoff) I dug a little deeper and this turned out to be a subtle packaging bug / debhelper oddity:  The current debian/rules file uses: ``` dh $@  --with autoreconf --with-systemd  ``` Looks...
[08:10:23] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 (owner: 10Jcrespo)
[08:11:32] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 (owner: 10Jcrespo)
[08:11:43] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2048 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354386 (owner: 10Jcrespo)
[08:14:31] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2048 for reimage (duration: 00m 39s)
[08:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:05] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[08:38:25] <icinga-wm>	 PROBLEM - SSH on ms-be1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:38:55] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set
[08:39:04] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: allow reimage of db2048 for upgrade to jessie [puppet] - 10https://gerrit.wikimedia.org/r/354388
[08:39:15] <icinga-wm>	 RECOVERY - SSH on ms-be1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0)
[08:47:23] <wikibugs>	 (03PS1) 10Jforrester: Enable TimedMediaHandler's new video player Beta Feature in Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354389 (https://phabricator.wikimedia.org/T148103)
[08:47:25] <wikibugs>	 (03PS1) 10Jforrester: Enable TimedMediaHandler's new video player Beta Feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354390 (https://phabricator.wikimedia.org/T148103)
[08:58:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Not yet- issues on db2049 reimage :-/" [puppet] - 10https://gerrit.wikimedia.org/r/354388 (owner: 10Jcrespo)
[09:01:23] <logmsgbot>	 !log reedy@tin Synchronized php-1.30.0-wmf.1/extensions/WikimediaMaintenance/makeSizeDBLists.php: Catch a silly error (duration: 00m 39s)
[09:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:15] <wikibugs>	 (03PS1) 10Reedy: Update dblists! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354435
[09:11:25] <Reedy>	 Amazing
[09:11:36] <Reedy>	 1 wiki moved to large
[09:11:49] <Reedy>	 One moved medium to small
[09:11:55] <Reedy>	 Or 2 even?
[09:11:56] <Reedy>	 Some missing?
[09:12:18] <Reedy>	 30+ moving to medium from small
[09:12:51] <bawolff_>	 ITS GROWING!!!
[09:13:03] <apergos>	 what is?
[09:13:12] <bawolff_>	 Wikis
[09:13:20] <wikibugs>	 (03CR) 10Reedy: [C: 032] Update dblists! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354435 (owner: 10Reedy)
[09:13:45] <apergos>	 did I miss something while I was disconnected?
[09:13:47] <apergos>	 oh!
[09:13:53] <Reedy>	 heh
[09:14:36] <apergos>	 what are your large medium and small lists used for?
[09:14:50] <wikibugs>	 (03Merged) 10jenkins-bot: Update dblists! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354435 (owner: 10Reedy)
[09:14:54] <Reedy>	 We use them for some cronjob stuffs
[09:15:48] <logmsgbot>	 !log reedy@tin Synchronized dblists/: Update size dblists (duration: 00m 39s)
[09:15:53] <wikibugs>	 (03CR) 10jenkins-bot: Update dblists! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354435 (owner: 10Reedy)
[09:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:07] <p858snake>	 Time to turn off miser mode *runs* >.> <.<
[09:24:03] <hashar>	 Nemo_bis: hello. Why are you reverting all l10n updates???
[09:24:40] <hashar>	 Nemo_bis: that overloads CI and the few reverts I have seen revert bunch of other strings
[09:25:03] <hashar>	 Nemo_bis: probably nicer to just fix l10n-bot and wait for tonight for it to magically update the repos
[09:38:43] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: allow reimage of db2048 for upgrade to jessie [puppet] - 10https://gerrit.wikimedia.org/r/354388
[09:38:45] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438
[09:39:49] <jynus>	 Reedy, where is cebwiki?
[09:40:17] <jynus>	 because for some table, it could be the largest wiki of all (templatelinks)
[09:41:32] <tto>	 jynus, I'm not reedy, but cebwiki is in large.dblist
[09:41:56] <jynus>	 good
[09:42:05] <tto>	 :)
[09:43:00] <jynus>	 it is nice for once to have lots of people in my timezone :-)
[09:43:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438 (owner: 10Jcrespo)
[09:44:05] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438
[09:49:06] <wikibugs>	 (03PS3) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148)
[09:51:05] <wikibugs>	 (03PS4) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148)
[09:51:43] <greg-g>	 jynus: don't get used to it ;)
[09:53:39] <wikibugs>	 (03CR) 10Thcipriani: [C: 031] "Deployed in beta, puppet compiler happy: https://puppet-compiler.wmflabs.org/6488/ should be ready to whenever" [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani)
[10:05:49] <wikibugs>	 (03PS1) 10Volans: CLI: add -i/--interactive option [software/cumin] - 10https://gerrit.wikimedia.org/r/354442
[10:07:08] <moritzm>	 !log rebooting mw2220/mw2221 for update to Linux 4.9 / HHVM 3.18 / nutcracker tests
[10:07:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:28] <_joe_>	 !log moved stale repos to /srv/deployment/STALE on tin, T129290
[10:08:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:36] <stashbot>	 T129290: [keyresult] Migrate remaining trebuchet deployed services - https://phabricator.wikimedia.org/T129290
[10:14:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0]
[10:16:36] <wikibugs>	 06Operations, 05Goal, 07kubernetes: Assigning IP space for kubernetes pod IPs - https://phabricator.wikimedia.org/T165732#3275340 (10akosiaris)
[10:17:15] <hashar>	 Zuul Gearman alert is legit.  There is a large amount of mediawiki extensions changes in the pipes right now
[10:18:27] <wikibugs>	 06Operations, 07Technical-Debt: Supersede RT tickets references - https://phabricator.wikimedia.org/T165733#3275388 (10Dereckson)
[10:18:35] <icinga-wm>	 ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [140.0] amusso Lot of mw extensions l10n reverts going on.
[10:19:49] <moritzm>	 !log powercycling mw2221, stuck in reboot and serial console unresponsive
[10:19:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:10] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: allow reimage of db2048 for upgrade to jessie [puppet] - 10https://gerrit.wikimedia.org/r/354388 (owner: 10Jcrespo)
[10:20:33] <wikibugs>	 06Operations, 05Goal, 07kubernetes: Assigning IP space for kubernetes pod IPs - https://phabricator.wikimedia.org/T165732#3275406 (10akosiaris)
[10:20:56] <jynus>	 arg, I merged the wrong patch
[10:20:56] <p858snake>	 hashar: from extreamly vague memory the changes need to be reverted for it to work compared to just doing the fixes on translatewiki
[10:21:05] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: allow reimage of db2048 for upgrade to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/354445
[10:21:16] <hashar>	 p858snake: so I guess there is a good reason for the revert :D
[10:21:21] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438
[10:22:36] <wikibugs>	 (03Abandoned) 10Dereckson: toollabs: use UNIX agnostic shebang [puppet] - 10https://gerrit.wikimedia.org/r/327709 (owner: 10Dereckson)
[10:23:33] <wikibugs>	 06Operations, 10ops-codfw: mw2221 stuck after reboot - https://phabricator.wikimedia.org/T165734#3275446 (10MoritzMuehlenhoff)
[10:23:39] <wikibugs>	 06Operations, 10ops-codfw: mw2221 stuck after reboot - https://phabricator.wikimedia.org/T165734#3275460 (10MoritzMuehlenhoff) p:05Triage>03Normal
[10:24:39] <icinga-wm>	 ACKNOWLEDGEMENT - Host mw2221 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T165734
[10:30:56] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3275482 (10Ottomata) I have a question about the new profile guidelines:  > Profile classes should only have parameters that default to an explicit hier...
[10:31:35] <ebernhardson>	 !log restarting elsaticsearch on relforge1001 to pull in remote reindex
[10:31:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:33] <wikibugs>	 06Operations, 10Analytics, 10Traffic, 15User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3275487 (10elukey)
[10:33:54] <wikibugs>	 06Operations, 10Analytics, 10Traffic, 15User-Elukey: Update Varnishkafka to support TLS encryption/authentication - https://phabricator.wikimedia.org/T165736#3275506 (10elukey)
[10:34:02] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Test trusty install on db2049 to confirm hw issues [puppet] - 10https://gerrit.wikimedia.org/r/354438 (owner: 10Jcrespo)
[10:34:26] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mariadb: allow reimage of db2048 for upgrade to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/354445 (owner: 10Jcrespo)
[10:34:30] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: allow reimage of db2048 for upgrade to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/354445
[10:34:34] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Revert "mariadb: allow reimage of db2048 for upgrade to jessie" [puppet] - 10https://gerrit.wikimedia.org/r/354445 (owner: 10Jcrespo)
[10:41:35] <wikibugs>	 (03PS1) 10Jcrespo: Revert "Revert "mariadb: allow reimage of db2048 for upgrade to jessie"" [puppet] - 10https://gerrit.wikimedia.org/r/354448
[10:41:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Not ready yet." [puppet] - 10https://gerrit.wikimedia.org/r/354448 (owner: 10Jcrespo)
[10:46:43] <wikibugs>	 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic, 15User-Elukey: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3275543 (10elukey)
[10:49:53] <wikibugs>	 (03PS1) 10Elukey: [WIP] Refactor zookeeper roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/354449
[10:54:39] <wikibugs>	 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3275550 (10MoritzMuehlenhoff) The new package fixes that (tested by rebooting two servers with and without the new package:)  root@mw2222:~# dpkg -l nutcracker ii  nutcracker                           0.4.1-1+wm3...
[11:05:09] <moritzm>	 !log uploaded nutcracker 0.4.1-1+wm3~jessie1 to apt.wikimedia.org (T163795)
[11:05:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:19] <stashbot>	 T163795: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795
[11:06:29] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Assign the kubernetes pod IPs in DNS [dns] - 10https://gerrit.wikimedia.org/r/341794 (https://phabricator.wikimedia.org/T165732)
[11:11:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "And with some more change now we have:" [dns] - 10https://gerrit.wikimedia.org/r/341794 (https://phabricator.wikimedia.org/T165732) (owner: 10Alexandros Kosiaris)
[11:11:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:12:25] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1019 is OK: OK ferm input default policy is set
[11:12:56] <wikibugs>	 06Operations, 05Goal, 13Patch-For-Review, 07kubernetes: Assigning IP space for kubernetes pod IPs - https://phabricator.wikimedia.org/T165732#3275571 (10akosiaris) With the patch above we have:  * production clusters (codfw + eqiad) IPv4, IPv6 pod IPs assigned * staging cluster (eqiad) IPv4, IPv6 pod IPs a...
[11:16:31] <wikibugs>	 (03PS1) 10Elukey: Remove any reference of mc1001->mc1018 for decom [puppet] - 10https://gerrit.wikimedia.org/r/354453 (https://phabricator.wikimedia.org/T164341)
[11:17:07] <wikibugs>	 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3275580 (10jcrespo)
[11:18:06] <wikibugs>	 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3275594 (10jcrespo) We tested installing trusty and it worked ok.  > <moritzm> it might a case of the newer kernel having more stringent checks, which expose some hardwar...
[11:31:03] <wikibugs>	 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3275616 (10jcrespo) No relevant recent logs:   ``` 7  Caution POST Message 11/16/2016 17:07 11/16/2016 17:07 1 POST Error: 1792-Slot X Drive Array - Valid Data Found in C...
[11:32:55] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0]
[11:39:03] <marostegui>	 !log Deploy alter table s2.revision table on labsdb1003 - T162611
[11:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:12] <stashbot>	 T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611
[11:56:32] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 06Release-Engineering-Team (Next), 15User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675#3275656 (10greg)
[12:02:54] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [WiP] [debs/pybal] - 10https://gerrit.wikimedia.org/r/302882
[12:10:58] <wikibugs>	 (03PS5) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148)
[12:15:57] <wikibugs>	 (03PS6) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148)
[12:16:55] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: report puppet agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354007
[12:16:58] <wikibugs>	 (03PS1) 10Filippo Giunchedi: base: report prometheus agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354457
[12:35:31] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add alertmanager_url to prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/354459
[12:35:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460
[12:36:07] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354462 (https://phabricator.wikimedia.org/T165743)
[12:37:46] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354462 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui)
[12:40:08] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2068 - T165743 (duration: 00m 40s)
[12:40:10] <marostegui>	 !log Deploy alter table s7.frwiktionary on db2068 - T165743
[12:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:17] <stashbot>	 T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743
[12:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add notes on branching/releasing [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464
[12:42:57] <_joe_>	 ema: ^^
[12:50:31] <wikibugs>	 (03PS1) 10Thcipriani: Deployment via scap3 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/354466 (https://phabricator.wikimedia.org/T165748)
[12:51:13] <Zppix>	 jouncebot:  refresh
[12:51:15] <jouncebot>	 I refreshed my knowledge about deployments.
[12:53:42] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354467
[12:56:44] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354467 (owner: 10Marostegui)
[12:57:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354467 (owner: 10Marostegui)
[12:57:59] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354467 (owner: 10Marostegui)
[12:58:39] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2068 - T165743 (duration: 00m 39s)
[12:58:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:47] <stashbot>	 T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743
[13:08:33] <wikibugs>	 (03PS1) 10Thcipriani: Scap3: deploy logstash/plugins with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354472 (https://phabricator.wikimedia.org/T165748)
[13:09:19] <moritzm>	 !log downgraded mw1161 to HHVM 3.12 (crashes often compared to app servers, downgrade over the weekend)
[13:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:17:07] <wikibugs>	 (03PS1) 10DCausse: [wikitech] Increase weight on Tool and Nova Resource ns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354474 (https://phabricator.wikimedia.org/T165725)
[13:26:01] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354476 (https://phabricator.wikimedia.org/T165743)
[13:27:15] <wikibugs>	 (03CR) 10Ema: [C: 031] "Looks good!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464 (owner: 10Giuseppe Lavagetto)
[13:28:53] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354476 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui)
[13:30:41] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354476 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui)
[13:30:52] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354476 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui)
[13:31:32] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2047 - T165743 (duration: 00m 39s)
[13:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:40] <stashbot>	 T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743
[13:34:50] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add notes on branching/releasing [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464
[13:39:35] <icinga-wm>	 RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[13:41:05] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/354480
[13:41:07] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/354481
[13:47:14] <marostegui>	 !log Deploy alter table s7.frwiktionary db1033 - T165743
[13:47:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:22] <stashbot>	 T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743
[13:47:56] <wikibugs>	 (03PS1) 10Chad: gerrit (2.13.8+git1-wmf.1) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/354485
[13:54:43] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354487 (https://phabricator.wikimedia.org/T165743)
[13:56:03] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354487 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui)
[13:57:00] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354487 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui)
[13:57:09] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Repool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354487 (https://phabricator.wikimedia.org/T165743) (owner: 10Marostegui)
[13:58:03] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2047 - T165743 (duration: 00m 38s)
[13:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:13] <stashbot>	 T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743
[13:58:21] <wikibugs>	 (03PS2) 10Chad: gerrit (2.13.8+git1-wmf.1) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/354485 (https://phabricator.wikimedia.org/T158946)
[14:01:10] <wikibugs>	 (03PS1) 10Hoo man: Use kill -- -$$ to kill a process group in dumpwikidata scripts [puppet] - 10https://gerrit.wikimedia.org/r/354489
[14:04:08] <wikibugs>	 06Operations, 10Pybal, 10Traffic: Fully-redundant LVS clusters using Pybal per-service MED feature - https://phabricator.wikimedia.org/T165764#3276237 (10BBlack)
[14:09:56] <wikibugs>	 (03PS1) 10Hoo man: dumpWikidata: Make the minimum shard size depend on the number of shards [puppet] - 10https://gerrit.wikimedia.org/r/354494
[14:10:36] <icinga-wm>	 PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:11:45] <icinga-wm>	 PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:12:35] <icinga-wm>	 RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures
[14:20:26] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.15 seconds
[14:20:52] <wikibugs>	 06Operations, 10Traffic: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765#3276305 (10BBlack)
[14:21:39] <wikibugs>	 06Operations, 10Traffic: Refactor pybal/LVS config for shared failover - https://phabricator.wikimedia.org/T165765#3276336 (10BBlack)
[14:22:15] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is CRITICAL: connect to address 10.64.0.36 and port 9042: Connection refused
[14:22:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 349.14 seconds
[14:22:35] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[14:22:45] <urandom>	 got that ^^ 
[14:22:45] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:23:45] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational
[14:24:15] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.36:9042 on restbase-dev1001 is OK: TCP OK - 0.036 second response time on 10.64.0.36 port 9042
[14:24:36] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.36:7001 on restbase-dev1001 is OK: SSL OK - Certificate restbase-dev1001-a valid until 2018-01-05 22:53:02 +0000 (expires in 231 days)
[14:30:49] <wikibugs>	 (03CR) 10Paladox: [C: 031] gerrit (2.13.8+git1-wmf.1) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/354485 (https://phabricator.wikimedia.org/T158946) (owner: 10Chad)
[14:34:48] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Add notes on branching/releasing [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464
[14:34:50] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/354480
[14:34:52] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/354481
[14:38:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 46.00 seconds
[14:38:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.24 seconds
[14:39:45] <icinga-wm>	 RECOVERY - puppet last run on ms-be1028 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[14:42:42] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] Use kill -- -$$ to kill a process group in dumpwikidata scripts [puppet] - 10https://gerrit.wikimedia.org/r/354489 (owner: 10Hoo man)
[14:43:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 316.24 seconds
[14:43:28] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] dumpWikidata: Make the minimum shard size depend on the number of shards [puppet] - 10https://gerrit.wikimedia.org/r/354494 (owner: 10Hoo man)
[14:43:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Add notes on branching/releasing [debs/pybal] - 10https://gerrit.wikimedia.org/r/354464 (owner: 10Giuseppe Lavagetto)
[14:44:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Change the default LVS BGP behavior per service [debs/pybal] - 10https://gerrit.wikimedia.org/r/354480 (owner: 10Giuseppe Lavagetto)
[14:45:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.19 seconds
[14:45:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Add unit tests for DNSQueryMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/354481 (owner: 10Giuseppe Lavagetto)
[14:50:44] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3276493 (10jcrespo) 05declined>03Open This just happened again on s4.
[14:52:37] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3224448 (10Marostegui) Some graphs that were shown while troublshooting  https://grafa...
[14:53:35] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2102826
[14:54:28] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3276528 (10jcrespo) p:05Low>03Triage This is probably not user-requested invalidat...
[15:05:41] <wikibugs>	 (03PS1) 10Ema: Bump version number in setup.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354502
[15:07:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Bump version number in setup.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354502 (owner: 10Ema)
[15:08:27] <wikibugs>	 (03Abandoned) 10Ema: Bump version number in setup.py [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/354180 (owner: 10Ema)
[15:08:38] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Split IPVS Manager into the interface and manager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354506
[15:08:40] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add IPVSError as a generic IPVS-related exception [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354507
[15:08:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add generic Finite States Machine [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354508
[15:08:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [WiP] [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354509
[15:08:48] <wikibugs>	 (03Merged) 10jenkins-bot: Bump version number in setup.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/354502 (owner: 10Ema)
[15:09:18] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Split IPVS Manager into the interface and manager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/302434 (owner: 10Giuseppe Lavagetto)
[15:16:28] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3276584 (10jcrespo) Without entering on heavy rearchitectures, we should, an probably...
[15:23:40] <wikibugs>	 (03PS1) 10Andrew Bogott: openstackclients:  add an optional project arg to allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/354515
[15:23:42] <wikibugs>	 (03PS1) 10Andrew Bogott: novastats:  Update some reports to use more up-to-date code. [puppet] - 10https://gerrit.wikimedia.org/r/354516
[15:24:09] <wikibugs>	 (03PS2) 10Ema: Split IPVS Manager into the interface and manager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354506 (owner: 10Giuseppe Lavagetto)
[15:26:37] <wikibugs>	 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3276610 (10BBlack) 05Resolved>03Open Not resolved, as the purge graphs can attest!
[15:26:42] <wikibugs>	 06Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3276612 (10BBlack)
[15:26:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] novastats:  Update some reports to use more up-to-date code. [puppet] - 10https://gerrit.wikimedia.org/r/354516 (owner: 10Andrew Bogott)
[15:36:10] <wikibugs>	 (03PS2) 10Dzahn: Planet: Delete sr.planet [puppet] - 10https://gerrit.wikimedia.org/r/350242 (owner: 10Chad)
[15:36:45] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3276633 (10RobH)
[15:37:15] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3276650 (10jcrespo)
[15:38:29] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Planet: Delete sr.planet [puppet] - 10https://gerrit.wikimedia.org/r/350242 (owner: 10Chad)
[15:39:06] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3276658 (10RobH) 05Open>03Resolved These systems have been ordered on T163822 and installation will progress on T165779.  Resolving this request, as its being gran...
[15:40:58] <mutante>	 !log planet10001 - manually deleting cron job for deleted sr.planet (should puppetize the "absence" too)
[15:41:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:02] <wikibugs>	 (03PS2) 10Dzahn: Drop sr.planet from dns, it's moribund [dns] - 10https://gerrit.wikimedia.org/r/350243 (owner: 10Chad)
[15:45:01] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Drop sr.planet from dns, it's moribund [dns] - 10https://gerrit.wikimedia.org/r/350243 (owner: 10Chad)
[15:52:21] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3276700 (10RobH)
[15:52:55] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3276717 (10RobH) 05Open>03Resolved These systems have been ordered on T163031 and will be setup on T165781.
[15:53:15] <Dereckson>	 mutante: RainbowSprinkles: you also dropped it from puppet repo?
[15:53:29] <wikibugs>	 (03PS2) 10Ema: Add IPVSError as a generic IPVS-related exception [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354507 (owner: 10Giuseppe Lavagetto)
[15:55:19] <wikibugs>	 (03PS2) 10Ema: Add generic Finite States Machine [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354508 (owner: 10Giuseppe Lavagetto)
[15:55:28] <wikibugs>	 (03PS2) 10Ema: Add netlink-based Ipvsmanager implementation [WiP] [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354509 (owner: 10Giuseppe Lavagetto)
[15:56:51] <mutante>	 Dereckson: yes, just merged that up there a few minutes earlier
[15:57:28] <Dereckson>	 ok
[15:58:11] <wikibugs>	 (03CR) 10Ema: [C: 031] Split IPVS Manager into the interface and manager implementation [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354506 (owner: 10Giuseppe Lavagetto)
[15:58:21] <wikibugs>	 (03CR) 10Ema: [C: 031] Add IPVSError as a generic IPVS-related exception [debs/pybal] (2.0-dev) - 10https://gerrit.wikimedia.org/r/354507 (owner: 10Giuseppe Lavagetto)
[15:58:26] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs, 10procurement: rack/setup/install labmon1003 - https://phabricator.wikimedia.org/T165784#3276770 (10RobH)
[15:59:41] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs, 10procurement: rack/setup/install labmon1003 - https://phabricator.wikimedia.org/T165784#3276770 (10RobH) Please note that once the initial onsite-specific steps are done (steps up do the network port setup), I can handle the operations/puppet repo updates and install the...
[15:59:55] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs: rack/setup/install labmon1003 - https://phabricator.wikimedia.org/T165784#3276806 (10RobH)
[16:01:56] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3276828 (10jcrespo) Lots of category pages invalidations happening at that time:  ``` UPDATE /* Title::invalidateCache  */  `page` SE...
[16:02:44] <wikibugs>	 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for dedicated labmon1002 - https://phabricator.wikimedia.org/T161750#3276850 (10RobH) 05Open>03Resolved This has been ordered on T163808 and its setup will be tracked on T165784.
[16:04:52] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3276894 (10jcrespo) For the long term: how useful is this field, and could it be separated from the rest of the table if it happens t...
[16:07:20] <Danny_B>	 Request from 217.196.74.137 via cp3041 cp3041, Varnish XID 186936691
[16:07:23] <Danny_B>	 Error: 503, Service Unavailable at Fri, 19 May 2017 16:06:48 GMT
[16:08:22] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3276925 (10Papaul) a:05Papaul>03Marostegui Disk replacement complete. Return information for bad disk attached. {F8123582}
[16:23:09] <Danny_B>	 _joe_: where can i find you?
[16:23:13] <wikibugs>	 06Operations, 10ops-codfw: mw2221 stuck after reboot - https://phabricator.wikimedia.org/T165734#3277017 (10Papaul) a:05Papaul>03MoritzMuehlenhoff - Removed both PSU for about 5 minutes - Update the IDRAC firmware from 2.30 to 2.41  - update BIOS fro 1.6.1 to 2.4.2  System is back up again.
[16:25:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2221 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[16:26:15] <icinga-wm>	 PROBLEM - Check systemd state on mw2221 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:28:39] <Danny_B>	 i'm constantly getting 500 when trying to load my watchlist. for about 30 mins
[16:30:07] <paladox>	 Danny_B some of the ops are currently at the hackathon.
[16:30:42] <paladox>	 Danny_B where are you experencing this?
[16:30:46] <paladox>	 Works for me on https://en.wikipedia.org/wiki/Special:Watchlist
[16:31:15] <Danny_B>	 paladox: i know, but i can't find them
[16:31:22] <paladox>	 Oh
[16:31:33] <Danny_B>	 hence why i'm trying to poke it here
[16:32:23] <paladox>	 ok
[16:33:47] <wikibugs>	 (03PS2) 10Andrew Bogott: openstackclients:  add an optional project arg to allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/354515
[16:33:49] <wikibugs>	 (03PS2) 10Andrew Bogott: novastats:  Update some reports to use more up-to-date code. [puppet] - 10https://gerrit.wikimedia.org/r/354516
[16:34:07] <paladox>	 Danny_B is it intermittent? Could you create a task in phab please?
[16:35:50] <wikibugs>	 (03PS1) 10Reedy: Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532
[16:36:39] <Reedy>	 Danny_B: They're hiding somewhere apparently
[16:37:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 (owner: 10Reedy)
[16:37:51] <wikibugs>	 (03PS2) 10Reedy: Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532
[16:38:55] <Reedy>	 Danny_B: Big watchlist?
[16:40:19] <Danny_B>	 Reedy: wouldn't say so... i don't remember particular number, but i guess less than 300
[16:40:33] <Danny_B>	 it was working properly 5 days ago
[16:41:18] <wikibugs>	 (03CR) 10Reedy: [C: 032] Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 (owner: 10Reedy)
[16:42:34] <gwicke>	 https://logstash.wikimedia.org/goto/a4f68641242e49dcadf151aa316f609e has some information related to Danny_B's issue
[16:42:53] <wikibugs>	 (03Merged) 10jenkins-bot: Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 (owner: 10Reedy)
[16:43:06] <wikibugs>	 (03CR) 10jenkins-bot: Throttle rule for Wikimedia Hackathon Vienna... [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354532 (owner: 10Reedy)
[16:44:32] <gwicke>	 https://gerrit.wikimedia.org/r/#/c/350914/
[16:44:39] <gwicke>	 possible fix for the issue ^^
[16:44:49] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/throttle.php: Wikimedia Vienna Hackathon (duration: 00m 39s)
[16:44:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:11] <gwicke>	 https://phabricator.wikimedia.org/T164059
[16:46:09] <paladox>	 gwicke for the issue Danny_B is having?
[16:46:20] <Danny_B>	 yes
[16:49:16] <Danny_B>	 the issue is obviouslz with enhanced recent changes / watchlist
[16:51:49] <bawolff_>	 Wait is that issue still unresolved?
[16:52:13] <Danny_B>	 obviously yes
[16:52:25] <Danny_B>	 because normal watchlist works for me now
[16:52:31] <Danny_B>	 enhanced not
[17:16:45] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group
[17:22:15] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[17:24:48] <bawolff_>	 I'm just kind of surprised its not fixed, seemed kind of trivial
[17:33:15] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[17:35:23] <wikibugs>	 (03PS1) 10Ema: Set empty PYTHONPATH in tox.ini [debs/pybal] - 10https://gerrit.wikimedia.org/r/354547
[17:38:04] <wikibugs>	 (03PS1) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548
[17:39:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 (owner: 10Dzahn)
[17:40:21] <wikibugs>	 (03PS2) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548
[17:41:10] <wikibugs>	 (03PS3) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548
[17:41:15] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:43:35] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 233882
[17:44:31] <Danny_B>	 bawolff_: you are more than welcome to take it over...
[17:52:13] <wikibugs>	 (03PS1) 10Dereckson: Always show latest revision even if not reviewed on hu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354549 (https://phabricator.wikimedia.org/T121995)
[18:14:25] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1295 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[18:15:25] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1295 is OK: OK
[18:21:55] <icinga-wm>	 PROBLEM - MD RAID on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:22:45] <icinga-wm>	 RECOVERY - MD RAID on mw1294 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[18:28:45] <icinga-wm>	 PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:29:35] <icinga-wm>	 RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 80292 bytes in 1.303 second response time
[18:31:55] <icinga-wm>	 PROBLEM - Check systemd state on mw1294 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:32:45] <icinga-wm>	 RECOVERY - Check systemd state on mw1294 is OK: OK - running: The system is fully operational
[19:10:17] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#3277249 (10Jgreen)
[19:10:43] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in wikimedia.org domain - https://phabricator.wikimedia.org/T137161#2359459 (10Jgreen)
[19:11:22] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#2359459 (10Jgreen)
[19:13:11] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3277260 (10Jgreen) a:05Jgreen>03None This isn't something fr-tech-ops can fix, it's an external site.
[19:13:36] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute
[19:14:35] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute
[19:19:43] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3277275 (10Krinkle)
[19:24:19] <wikibugs>	 (03PS4) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548
[19:28:42] <wikibugs>	 (03PS5) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548
[19:30:40] <wikibugs>	 (03PS6) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548
[19:34:59] <wikibugs>	 (03PS7) 10Dzahn: wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548
[19:42:27] <wikibugs>	 (03CR) 10Dzahn: [C: 032] wikistats: make db_pass a parameter, use fqdn_rand_string [puppet] - 10https://gerrit.wikimedia.org/r/354548 (owner: 10Dzahn)
[19:56:05] <wikibugs>	 (03PS1) 10Dzahn: wikistats: add missing .erb file extension to grants.sql [puppet] - 10https://gerrit.wikimedia.org/r/354567
[19:56:10] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3277390 (10aaron) The query does not come from HTMLCacheUpdateJob (which calls HTMLCacheUpdateJob::invalidateTitles) or seemingly any...
[19:56:37] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] wikistats: add missing .erb file extension to grants.sql [puppet] - 10https://gerrit.wikimedia.org/r/354567 (owner: 10Dzahn)
[20:02:46] <wikibugs>	 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3277398 (10aaron) a:05aaron>03None
[20:03:54] <wikibugs>	 06Operations, 10DBA, 06Performance-Team, 10Traffic, 10Wikidata: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3277400 (10aaron)
[20:07:34] <wikibugs>	 (03PS1) 10Dzahn: wikistats: fix typo in db.pp "requires" -> "require" [puppet] - 10https://gerrit.wikimedia.org/r/354568
[20:12:58] <wikibugs>	 (03CR) 10Dzahn: [C: 032] wikistats: fix typo in db.pp "requires" -> "require" [puppet] - 10https://gerrit.wikimedia.org/r/354568 (owner: 10Dzahn)
[20:43:52] <wikibugs>	 (03PS1) 10Dereckson: Fix hy.wikipedia high resolution logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354586 (https://phabricator.wikimedia.org/T165811)
[20:46:05] <wikibugs>	 (03PS1) 10Dzahn: wikistats: use systemuser for git cloning [puppet] - 10https://gerrit.wikimedia.org/r/354588
[21:00:29] <wikibugs>	 (03CR) 10Dzahn: [C: 032] wikistats: use systemuser for git cloning [puppet] - 10https://gerrit.wikimedia.org/r/354588 (owner: 10Dzahn)
[21:05:47] <wikibugs>	 (03PS1) 10Dzahn: wikistats: 'user' -> 'owner' parameter for /srv/wikistats [puppet] - 10https://gerrit.wikimedia.org/r/354591
[21:07:01] <wikibugs>	 (03CR) 10Paladox: [C: 031] wikistats: 'user' -> 'owner' parameter for /srv/wikistats [puppet] - 10https://gerrit.wikimedia.org/r/354591 (owner: 10Dzahn)
[21:07:16] <wikibugs>	 (03CR) 10Dzahn: [C: 032] wikistats: 'user' -> 'owner' parameter for /srv/wikistats [puppet] - 10https://gerrit.wikimedia.org/r/354591 (owner: 10Dzahn)
[21:17:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdf]
[21:24:23] <wikibugs>	 (03PS1) 10Reedy: Remove wikimedia-periodic-update.sh [puppet] - 10https://gerrit.wikimedia.org/r/354596
[21:46:45] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[21:46:46] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[21:46:48] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[21:46:48] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[21:46:48] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[21:46:48] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[21:47:55] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[21:48:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[21:48:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy
[21:48:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[21:48:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[21:48:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[21:48:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[21:48:45] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[21:56:51] <wikibugs>	 (03PS1) 10Nemo bis: Remove $wgEnableValidationStatisticsUpdates from FlaggedRevs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354600
[22:26:20] <wikibugs>	 (03PS1) 10Dzahn: wikistats: puppetize missing dir in /usr/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/354603
[22:27:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: puppetize missing dir in /usr/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn)
[22:27:26] <wikibugs>	 (03CR) 10Paladox: wikistats: puppetize missing dir in /usr/lib/ (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn)
[22:28:08] <wikibugs>	 (03PS2) 10Dzahn: wikistats: puppetize missing dir in /usr/lib/ [puppet] - 10https://gerrit.wikimedia.org/r/354603
[22:29:29] <wikibugs>	 (03CR) 10Paladox: [C: 031] "lgtm :)" [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn)
[22:33:24] <wikibugs>	 (03CR) 10Dzahn: "actually.. no.. that should be the home dir of the system user so it should be created when the user gets created." [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn)
[22:40:43] <wikibugs>	 (03PS3) 10Dzahn: wikistats: ensure systemuser exists before backup dir [puppet] - 10https://gerrit.wikimedia.org/r/354603
[22:41:08] <wikibugs>	 (03CR) 10Paladox: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn)
[22:42:28] <wikibugs>	 (03CR) 10Dzahn: [C: 032] wikistats: ensure systemuser exists before backup dir [puppet] - 10https://gerrit.wikimedia.org/r/354603 (owner: 10Dzahn)
[22:58:13] <wikibugs>	 (03PS2) 10Dereckson: Fix hy.wikipedia high resolution logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354586 (https://phabricator.wikimedia.org/T165811)
[23:07:58] <wikibugs>	 (03PS1) 10Nemo bis: Restore default $wgFlaggedRevsStatsAge (2 hours) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354608 (https://phabricator.wikimedia.org/T163107)
[23:09:40] <wikibugs>	 (03CR) 10Nemo bis: "Aaron, do you know what this was meant to do? See also f3ac9d067da1b8c27f94050cc9bc0251210d8415" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354608 (https://phabricator.wikimedia.org/T163107) (owner: 10Nemo bis)
[23:09:46] <wikibugs>	 (03Draft1) 10Paladox: Wikistats: Require /usr/lib/wikistats/schema.sql before executing mysql command [puppet] - 10https://gerrit.wikimedia.org/r/354609
[23:09:48] <wikibugs>	 (03PS2) 10Paladox: Wikistats: Require /usr/lib/wikistats/schema.sql before executing mysql command [puppet] - 10https://gerrit.wikimedia.org/r/354609
[23:11:46] <wikibugs>	 (03PS3) 10Paladox: wikistats: Require /usr/lib/wikistats/schema.sql before executing mysql command [puppet] - 10https://gerrit.wikimedia.org/r/354609
[23:13:26] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "yep, we needed that. thx" [puppet] - 10https://gerrit.wikimedia.org/r/354609 (owner: 10Paladox)
[23:46:12] <wikibugs>	 (03PS1) 10BryanDavis: Add Code of Conduct footer links to wikitech and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354612
[23:54:55] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[23:55:45] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[23:55:55] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[23:55:55] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[23:55:55] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[23:56:55] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[23:56:55] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[23:57:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[23:57:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[23:57:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[23:57:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[23:57:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[23:57:45] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[23:57:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy