[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160616T0000). [00:13:00] sync-dir hung for me at 99%. I'll give it a couple minutes, then retry. [00:14:20] PROBLEM - Apache HTTP on mw1147 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50408 bytes in 9.487 second response time [00:15:00] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:40] PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:52] PROBLEM - Disk space on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:52] PROBLEM - salt-minion processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:00] PROBLEM - nutcracker port on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:42] !log mattflaschen@tin Synchronized php-1.28.0-wmf.6/extensions/Kartographer: Search for maplinks inside and outside of content. (duration: 01m 08s) [00:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:50] PROBLEM - Check size of conntrack table on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:51] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:04] yurik, okay, it's correct on 387 servers, 1 failed, so it should be fine, unless you're doing an important demo. [00:17:22] 00:16:42 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'php-1.28.0-wmf.6', '--include', 'php-1.28.0-wmf.6/extensions', '--include', 'php-1.28.0-wmf.6/extensions/Kartographer', '--include', 'php-1.28.0-wmf.6/extensions/Kartographer/***', 'mw1097.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1010.eqiad.wmnet', 'mw2119.codfw.wmnet', 'mw2215.codfw.wmnet', 'mw2080.codfw.wmnet', 'mw1201.eqiad.wmnet', 'mw2187.codfw.wmnet', ' [00:17:24] mw1216.eqiad.wmnet'] on mw1147.eqiad.wmnet returned [255]: Connection to 10.64.16.127 timed out while waiting to read [00:17:30] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:31] Which probably-not-coincidentally is the same as: [00:17:31] PROBLEM - configured eth on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:40] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:42] PROBLEM - dhclient process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:18:17] ^ robh , sync problems. [00:18:45] yurik, the second time I ran it, it did complete, one apache failed though. [00:18:53] mw1147 died out? [00:19:00] rip mw1147 [00:19:05] thx matt_flaschen ! [00:19:06] lemme take a peek at it [00:19:38] robh, if that's what that scap pull error above means. Thanks for checking. [00:20:03] well, it has that plus then icinga shows it falling over [00:20:03] Scap complete [00:20:11] it was likely taxed and scap killed it, it happens. [00:21:09] trying to login is stalling out. [00:21:14] (from serial) [00:21:52] !log mw1147 seems to have died during scap, unresponsive from serial console, powercycled [00:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:23] matt_flaschen: what is the scap command i need to run to bring it back up to snuff once its online? [00:22:31] PROBLEM - nutcracker process on mw1147 is CRITICAL: Timeout while attempting connection [00:22:31] PROBLEM - HHVM processes on mw1147 is CRITICAL: Timeout while attempting connection [00:22:37] other than scapping everything again which seems excessive [00:22:51] (if you know that is ;) [00:22:58] robh: sync-common [00:23:09] cool, things havent changed that much then yay [00:23:20] i'll babysit its reboot and run that once its os is back [00:23:48] oh, it's called "scap pull" now [00:24:02] see, i wouldnt have known that, thank you =] [00:24:11] RECOVERY - Disk space on mw1147 is OK: DISK OK [00:24:11] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:24:12] RECOVERY - nutcracker port on mw1147 is OK: TCP OK - 0.000 second response time on port 11212 [00:24:32] RECOVERY - nutcracker process on mw1147 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:24:32] RECOVERY - HHVM processes on mw1147 is OK: PROCS OK: 6 processes with command name hhvm [00:24:38] !log mw1147 rebooted and manually running scap pull [00:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:00] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.234 second response time [00:25:02] RECOVERY - Check size of conntrack table on mw1147 is OK: OK: nf_conntrack is 0 % full [00:25:03] hrmm, maybe i should have screened that, here is hoping it doesnt take too long [00:25:11] RECOVERY - SSH on mw1147 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [00:25:21] * robh has to leave his place in les than 35 minutes [00:25:50] RECOVERY - DPKG on mw1147 is OK: All packages OK [00:25:51] RECOVERY - configured eth on mw1147 is OK: OK - interfaces up [00:25:55] 00:24:25 Copying to mw1147.eqiad.wmnet from deployment.eqiad.wmnet [00:25:55] 00:24:25 Started rsync common [00:26:00] and waiting, heh. [00:26:01] RECOVERY - dhclient process on mw1147 is OK: PROCS OK: 0 processes with command name dhclient [00:26:08] it should take a minute or two iirc [00:26:10] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 43 minutes ago with 0 failures [00:26:30] 00:26:23 Finished rsync common (duration: 01m 57s) [00:26:33] legoktm: you are correct [00:27:01] :) [00:27:04] https://wikitech.wikimedia.org/w/index.php?title=Wikimedia_binaries&type=revision&diff=657357&oldid=539850 [00:27:31] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 66410 bytes in 0.179 second response time [00:28:11] see, i wouldnt have known that, thank you =] [00:28:26] yeah but neither does anyone else, I think it tells you about the new command [00:29:02] change is hard, so is reading [00:29:26] jouncebot: doing the needful. [00:29:55] updating phabricator, downtime will be minimal [00:33:20] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [00:38:58] (03PS1) 10Luke081515: Two permission changes at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294652 (https://phabricator.wikimedia.org/T137888) [00:41:08] !log taking phabricator offline momentarily for scheduled maintenance. [00:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:43:17] !log phabricator upgrade/maintenance complete. Everything appears to be back up and running normally. [00:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:51:32] (03PS1) 1020after4: force HTTPS when x-forwarded-for header is set [puppet] - 10https://gerrit.wikimedia.org/r/294653 [00:52:07] can I get an opsen to merge https://gerrit.wikimedia.org/r/#/c/294653/ so that I can re-enable puppet on iridium? [00:52:43] (03CR) 10jenkins-bot: [V: 04-1] force HTTPS when x-forwarded-for header is set [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [00:52:52] hmm [00:54:16] wtf ...why is pep8 voting on puppet repo? that failure can't be related to my change anyway. [00:56:29] (03CR) 1020after4: [C: 031] "jenkins-bot is a liar. nothing wrong with this commit" [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [00:56:42] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:57:22] !log puppet disabled on iridium because https://gerrit.wikimedia.org/r/#/c/294653/ needs to merge (hotfix in preamble.php which puppet will undo if it's allowed to run) [00:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:31] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [01:06:52] (03CR) 1020after4: "legoktm: the commit doesn't even touch python code at all. this should not be a voting test if the repository state is already failing by " [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [01:23:06] (03PS11) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [01:24:15] (03CR) 10jenkins-bot: [V: 04-1] Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [01:27:36] 06Operations, 03Maps-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2384405 (10MaxSem) [01:31:08] (03CR) 1020after4: "This is needed before re-enabling puppet on iridium" [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [02:01:27] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:13:36] PROBLEM - puppet last run on db2043 is CRITICAL: CRITICAL: puppet fail [02:27:38] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:34:38] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 15m 49s) [02:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:58] PROBLEM - HHVM rendering on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:57] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:57] PROBLEM - puppet last run on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:17] PROBLEM - DPKG on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:18] PROBLEM - nutcracker port on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:36] PROBLEM - Disk space on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:37] PROBLEM - Check size of conntrack table on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:56] PROBLEM - HHVM processes on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:36:58] PROBLEM - nutcracker process on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:26] PROBLEM - SSH on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:28] PROBLEM - dhclient process on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:46] PROBLEM - configured eth on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:08] PROBLEM - salt-minion processes on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:58] RECOVERY - puppet last run on db2043 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [02:44:18] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 25 minutes ago with 0 failures [02:44:18] RECOVERY - salt-minion processes on mw1137 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:44:36] RECOVERY - DPKG on mw1137 is OK: All packages OK [02:44:37] RECOVERY - nutcracker port on mw1137 is OK: TCP OK - 0.000 second response time on port 11212 [02:44:48] RECOVERY - Disk space on mw1137 is OK: DISK OK [02:44:48] RECOVERY - Check size of conntrack table on mw1137 is OK: OK: nf_conntrack is 0 % full [02:45:07] RECOVERY - HHVM processes on mw1137 is OK: PROCS OK: 12 processes with command name hhvm [02:45:17] RECOVERY - nutcracker process on mw1137 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:45:37] RECOVERY - SSH on mw1137 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [02:45:46] RECOVERY - dhclient process on mw1137 is OK: PROCS OK: 0 processes with command name dhclient [02:45:57] RECOVERY - configured eth on mw1137 is OK: OK - interfaces up [02:51:37] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [02:52:37] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 81 failures [03:16:53] (03PS1) 10KartikMistry: apertium-arg: Initial Debian packaging [debs/contenttranslation/apertium-arg] - 10https://gerrit.wikimedia.org/r/294657 (https://phabricator.wikimedia.org/T124369) [03:20:49] (03PS1) 10KartikMistry: apertium-spa: Initial Debian packaging [debs/contenttranslation/apertium-spa] - 10https://gerrit.wikimedia.org/r/294658 (https://phabricator.wikimedia.org/T124370) [03:21:08] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:22:56] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [03:23:08] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [04:32:17] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [05:35:36] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [05:37:37] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5215616 keys - replication_delay is 0 [05:56:16] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:17:12] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:29:32] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:33:53] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:03] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:14] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:02] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:33] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:03] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:14] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:37] <_joe_> hi, puppetmaster [07:05:02] <_joe_> you know you're not even the lamest piece of software I ever had to manage? [07:10:02] 07Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 03Collab-Team-2016-Apr-Jun-Q4: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2384613 (10Nemo_bis) >>! In T119511#2379060, @ArielGlenn wrote: > Uh, this is done, insofar as... [08:03:48] (03PS1) 10Mobrovac: RESTBase: Make sendind resource_change events optional [puppet] - 10https://gerrit.wikimedia.org/r/294669 [08:05:08] (03CR) 10jenkins-bot: [V: 04-1] RESTBase: Make sendind resource_change events optional [puppet] - 10https://gerrit.wikimedia.org/r/294669 (owner: 10Mobrovac) [08:06:26] PROBLEM - configured eth on mw2247 is CRITICAL: Timeout while attempting connection [08:06:47] PROBLEM - mediawiki-installation DSH group on mw2247 is CRITICAL: Host mw2247 is not in mediawiki-installation dsh group [08:06:47] PROBLEM - dhclient process on mw2247 is CRITICAL: Timeout while attempting connection [08:07:10] (03Abandoned) 10Hashar: contint: cleanup gallium / use contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/293283 (https://phabricator.wikimedia.org/T137358) (owner: 10Hashar) [08:07:16] PROBLEM - nutcracker port on mw2247 is CRITICAL: Timeout while attempting connection [08:07:17] PROBLEM - HHVM jobrunner on mw2247 is CRITICAL: Connection timed out [08:07:46] PROBLEM - nutcracker process on mw2247 is CRITICAL: Timeout while attempting connection [08:07:57] PROBLEM - puppet last run on mw2247 is CRITICAL: Timeout while attempting connection [08:08:16] PROBLEM - salt-minion processes on mw2247 is CRITICAL: Timeout while attempting connection [08:08:27] (03CR) 10Mobrovac: "The tox failure has nothing to do with this patch ... This is becoming a bit annoying, honestly." [puppet] - 10https://gerrit.wikimedia.org/r/294669 (owner: 10Mobrovac) [08:08:36] PROBLEM - Check size of conntrack table on mw2247 is CRITICAL: Timeout while attempting connection [08:08:56] PROBLEM - DPKG on mw2247 is CRITICAL: Timeout while attempting connection [08:09:06] PROBLEM - Disk space on mw2247 is CRITICAL: Timeout while attempting connection [08:09:07] <_joe_> I am imaging a few servers [08:09:37] PROBLEM - MD RAID on mw2247 is CRITICAL: Timeout while attempting connection [08:15:12] (03PS1) 10KartikMistry: apertium-es-ca: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-es-ca] - 10https://gerrit.wikimedia.org/r/294671 (https://phabricator.wikimedia.org/T107306) [08:15:25] !log rebooting db1085 before putting it back into production [08:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:18:13] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2384669 (10KartikMistry) [08:19:41] (03CR) 10Mobrovac: "OK'ed by the PCC - https://puppet-compiler.wmflabs.org/3131/" [puppet] - 10https://gerrit.wikimedia.org/r/294669 (owner: 10Mobrovac) [08:20:52] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2384675 (10KartikMistry) [08:24:08] RECOVERY - MD RAID on mw2247 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:24:17] RECOVERY - nutcracker process on mw2247 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:24:56] RECOVERY - salt-minion processes on mw2247 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:25:07] RECOVERY - Check size of conntrack table on mw2247 is OK: OK: nf_conntrack is 0 % full [08:25:08] RECOVERY - configured eth on mw2247 is OK: OK - interfaces up [08:25:28] RECOVERY - dhclient process on mw2247 is OK: PROCS OK: 0 processes with command name dhclient [08:25:28] RECOVERY - DPKG on mw2247 is OK: All packages OK [08:25:46] RECOVERY - Disk space on mw2247 is OK: DISK OK [08:25:56] RECOVERY - nutcracker port on mw2247 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:30:26] RECOVERY - HHVM jobrunner on mw2247 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.083 second response time [08:33:16] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Puppet has 5 failures [08:33:34] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2384707 (10KartikMistry) [08:35:49] (03PS1) 10Jcrespo: Pool db1085, increase weight of all new db servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294672 (https://phabricator.wikimedia.org/T133398) [08:37:18] (03PS2) 10Muehlenhoff: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [08:38:46] (03CR) 10jenkins-bot: [V: 04-1] services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [08:39:09] (03CR) 10Jcrespo: [C: 032] Pool db1085, increase weight of all new db servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294672 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [08:40:41] is the operations-puppet-tox-jessie check now active? I'm wondering why https://gerrit.wikimedia.org/r/293515 failed jenkins? [08:41:02] <_joe_> it's active, yes [08:41:08] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool db1085, increase weight of all new db servers (duration: 00m 29s) [08:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:41:26] ERROR: InvocationError: '/home/jenkins/workspace/operations-puppet-tox-jessie/.tox/pep8/bin/flake8' [08:41:27] <_joe_> when it's not active you see "non-voting" on the side [08:42:37] (03PS1) 10KartikMistry: apertium-eus: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/294673 (https://phabricator.wikimedia.org/T107306) [08:42:42] ah, ok [08:45:38] so yeah yesterday [08:45:54] I have phased out the legacy job that was running pep8 1.4.6 in each directory containing python scripts [08:46:02] and switched to a job that runs 'tox' from the root of the repo [08:46:13] made possible thanks to Bryan and all reviewers that fixed all the python linting issues we had [08:46:23] so now CI ends up doing something like: [08:46:26] pip install flake8 [08:46:27] flake8 [08:46:59] 06Operations, 10DBA: High replication lag to dewiki - https://phabricator.wikimedia.org/T135100#2384720 (10jcrespo) [08:47:01] 06Operations, 10DBA, 13Patch-For-Review: reimage or decom db servers on precise - https://phabricator.wikimedia.org/T125028#2384721 (10jcrespo) [08:47:07] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2384723 (10jcrespo) [08:47:10] 06Operations, 10DBA, 13Patch-For-Review: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2384717 (10jcrespo) 05Open>03Resolved All 16 new servers (21 in total, 3 per shard) are pooled into production- we will do some adjustments over the foll... [08:47:16] bonus point, you can run 'tox' on your local machine to reproduce what CI is doing [08:48:07] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:48:23] oh [08:48:25] moritzm: got it [08:48:33] moritzm: we have defined the dependency as "flake8" [08:48:49] so that download whatever new version from pypi and one got released yesterday :( [08:50:48] 06Operations, 10DBA, 07Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#2384731 (10jcrespo) [08:50:49] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2384728 (10jcrespo) 05Open>03Resolved a:03jcrespo This is now fixed, D1 is no longer a SPOF. Although somehow heavy, if D1 or the whole D row went down, we... [08:50:55] (03PS1) 10Hashar: Explicitly pin flake8 to 2.5.5 [puppet] - 10https://gerrit.wikimedia.org/r/294674 [08:51:13] _joe_ moritzm : I guess we want to explicitly pin the flake8 version being used https://gerrit.wikimedia.org/r/294674 [08:51:20] since upstream tends to add new checks from time to time [08:51:29] (specially on a new minor version) [08:52:43] looks, good I'll merge [08:53:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Explicitly pin flake8 to 2.5.5 [puppet] - 10https://gerrit.wikimedia.org/r/294674 (owner: 10Hashar) [08:53:26] then 'recheck' your patch and it shall pass [08:53:36] k [08:53:39] (since your open patch is going to be tested as a merge on tip of production branch) [08:53:52] sorry should have thought about pinning the version [08:54:03] there is nothing more annoying than a Jenkins job failing for unrelated reasons [08:56:08] np, the joys of npm/pip etc. pp :-) [08:56:12] mobile and I are going to push a fix for MobileFrontend Special:Nearby . It has some javascript error due to a missing dependency in the RL definition [08:56:26] https://phabricator.wikimedia.org/T137919 for the bug and wmf.6 patch is https://gerrit.wikimedia.org/r/#/c/294649/ [08:56:35] (03PS3) 10Muehlenhoff: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [09:00:18] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2384745 (10KartikMistry) [09:02:17] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [09:04:34] PROBLEM - Apache HTTP on mw1278 is CRITICAL: Connection timed out [09:05:26] (03PS1) 10KartikMistry: apertium-hbs: Rebuild for Jessie and other fixes [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/294675 (https://phabricator.wikimedia.org/T107306) [09:05:44] PROBLEM - puppet last run on mw1278 is CRITICAL: Timeout while attempting connection [09:06:14] PROBLEM - salt-minion processes on mw1278 is CRITICAL: Timeout while attempting connection [09:07:04] PROBLEM - Check size of conntrack table on mw1278 is CRITICAL: Timeout while attempting connection [09:07:05] PROBLEM - DPKG on mw1278 is CRITICAL: Timeout while attempting connection [09:07:25] PROBLEM - Disk space on mw1278 is CRITICAL: Timeout while attempting connection [09:07:54] PROBLEM - MD RAID on mw1278 is CRITICAL: Timeout while attempting connection [09:08:24] <_joe_> it's I am installing that system [09:08:44] PROBLEM - configured eth on mw1278 is CRITICAL: Timeout while attempting connection [09:08:58] 06Operations: ffmpeg/libav on jessie video scalers - https://phabricator.wikimedia.org/T137886#2384763 (10MoritzMuehlenhoff) Sounds good, I'll rebuild libtheora as used on trusty for jessie-wikimedia and make a backport of ffmpeg2theora 0.30 for jessie. [09:09:04] PROBLEM - dhclient process on mw1278 is CRITICAL: Timeout while attempting connection [09:09:05] PROBLEM - mediawiki-installation DSH group on mw1278 is CRITICAL: Host mw1278 is not in mediawiki-installation dsh group [09:09:35] PROBLEM - nutcracker port on mw1278 is CRITICAL: Timeout while attempting connection [09:09:55] PROBLEM - nutcracker process on mw1278 is CRITICAL: Timeout while attempting connection [09:12:13] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2384764 (10KartikMistry) [09:13:37] (03PS1) 10Gehel: Interactive team would like to be notified of issues with Maps. [puppet] - 10https://gerrit.wikimedia.org/r/294676 (https://phabricator.wikimedia.org/T137869) [09:14:45] (03PS1) 10Filippo Giunchedi: install_server: rename ms-be partman config to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/294677 [09:15:23] (03PS1) 10Filippo Giunchedi: swift: redirect syslog from all daemons to separate file [puppet] - 10https://gerrit.wikimedia.org/r/294678 (https://phabricator.wikimedia.org/T137397) [09:15:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install_server: rename ms-be partman config to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/294677 (owner: 10Filippo Giunchedi) [09:16:50] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2384770 (10fgiunchedi) @papaul partman recipe would be the same as other `ms-be` systems from HP, namely `ms-be-hp.cfg`, thanks! also just to confirm, the 2x200GB SAS is SSD not spinning d... [09:17:09] (03CR) 10Giuseppe Lavagetto: [C: 031] Interactive team would like to be notified of issues with Maps. [puppet] - 10https://gerrit.wikimedia.org/r/294676 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [09:18:08] !log hashar@tin Synchronized php-1.28.0-wmf.6/extensions/MobileFrontend: MobileFrontend RL registration issue preventing Special:Nearby from working properly T137919 (duration: 00m 36s) [09:18:09] T137919: Uncaught Error: Module "mediawiki.router" is not loaded (on Special:Nearby) - https://phabricator.wikimedia.org/T137919 [09:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:33] (03PS3) 10Filippo Giunchedi: DNS: Add mgmt DNS entries for ms-be2022 to ms-be2027 [dns] - 10https://gerrit.wikimedia.org/r/294543 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [09:21:55] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, I've reworded the commit message" [dns] - 10https://gerrit.wikimedia.org/r/294543 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [09:23:04] (03PS1) 10Gehel: Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 [09:24:17] (03CR) 10Gehel: "I have not seen much use of resource default in our code base. I wonder if there is a reason for that (appart from the awful and non obvio" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [09:24:23] (03CR) 10jenkins-bot: [V: 04-1] Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [09:25:32] (03PS2) 10Gehel: Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 [09:26:37] (03PS2) 10Gehel: Interactive team would like to be notified of issues with Maps. [puppet] - 10https://gerrit.wikimedia.org/r/294676 (https://phabricator.wikimedia.org/T137869) [09:26:55] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:28:28] (03CR) 10Gehel: [C: 032] Interactive team would like to be notified of issues with Maps. [puppet] - 10https://gerrit.wikimedia.org/r/294676 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [09:41:05] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.010 second response time [09:44:05] 06Operations: "puppet fail" flapping on restbase1007 - https://phabricator.wikimedia.org/T137952#2384810 (10fgiunchedi) [09:44:26] RECOVERY - MD RAID on mw1278 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:44:45] PROBLEM - NTP on mw1278 is CRITICAL: NTP CRITICAL: Offset unknown [09:44:54] RECOVERY - salt-minion processes on mw1278 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:45:25] RECOVERY - configured eth on mw1278 is OK: OK - interfaces up [09:45:33] (03CR) 10Muehlenhoff: [C: 04-1] "I'm not really fond of that, the --output option in firejail uses it's own homegrown log rotation, let's rather redirect stdout/stderr in " [puppet] - 10https://gerrit.wikimedia.org/r/294499 (owner: 10Mobrovac) [09:45:35] RECOVERY - dhclient process on mw1278 is OK: PROCS OK: 0 processes with command name dhclient [09:45:44] RECOVERY - Check size of conntrack table on mw1278 is OK: OK: nf_conntrack is 0 % full [09:46:05] RECOVERY - Disk space on mw1278 is OK: DISK OK [09:46:15] RECOVERY - nutcracker port on mw1278 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:46:26] RECOVERY - nutcracker process on mw1278 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:47:55] RECOVERY - DPKG on mw1278 is OK: All packages OK [09:48:21] !log restbase deploy start of ebeaa46 [09:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:49:14] PROBLEM - HHVM rendering on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:49:15] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:35] PROBLEM - SSH on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:45] PROBLEM - configured eth on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:04] PROBLEM - nutcracker port on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:15] PROBLEM - Check size of conntrack table on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:25] PROBLEM - puppet last run on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:35] PROBLEM - DPKG on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:54:18] I can't even connect to the serial console of mw1143, I'm only getting "Disconnected from UNKNOWN port 0", can someone please doublecheck whether it also fails? [09:54:29] <_joe_> moritzm: I'll try [09:55:38] <_joe_> I got in [09:55:55] PROBLEM - dhclient process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:56:17] <_joe_> !log powercycling mw1143, unresponsive on ssh, console [09:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:25] PROBLEM - nutcracker process on mw1143 is CRITICAL: Timeout while attempting connection [09:57:04] PROBLEM - Disk space on mw1143 is CRITICAL: Timeout while attempting connection [09:57:24] PROBLEM - salt-minion processes on mw1143 is CRITICAL: Timeout while attempting connection [09:57:25] PROBLEM - HHVM processes on mw1143 is CRITICAL: Timeout while attempting connection [09:57:25] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 8 failures [09:58:35] RECOVERY - nutcracker process on mw1143 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:58:35] (03CR) 10Mobrovac: "I put up this patch mostly because I wasn't sure (and couldn't confirm) that firejail is actually letting stdout and stderr through to sys" [puppet] - 10https://gerrit.wikimedia.org/r/294499 (owner: 10Mobrovac) [09:58:44] RECOVERY - Check size of conntrack table on mw1143 is OK: OK: nf_conntrack is 0 % full [09:58:54] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 42 minutes ago with 0 failures [09:58:56] RECOVERY - DPKG on mw1143 is OK: All packages OK [09:59:14] RECOVERY - Disk space on mw1143 is OK: DISK OK [09:59:23] !log restbase deploy end of ebeaa46 [09:59:25] RECOVERY - salt-minion processes on mw1143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:59:25] RECOVERY - HHVM processes on mw1143 is OK: PROCS OK: 6 processes with command name hhvm [09:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:59:34] ahh, ssh -vv shows that my SSH client apparently negotiated a DH key exchange which is too modern for whatever they have installed there, so fails :-/ [09:59:45] RECOVERY - NTP on mw1278 is OK: NTP OK: Offset -0.004133582115 secs [10:00:04] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 66413 bytes in 7.482 second response time [10:00:05] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.461 second response time [10:00:15] RECOVERY - SSH on mw1143 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [10:00:15] RECOVERY - dhclient process on mw1143 is OK: PROCS OK: 0 processes with command name dhclient [10:00:24] RECOVERY - configured eth on mw1143 is OK: OK - interfaces up [10:00:45] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [10:03:01] <_joe_> moritzm: I'm on osx right now [10:05:49] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/294669 (owner: 10Mobrovac) [10:07:05] (03PS1) 10Alexandros Kosiaris: servermon: Remove old urls.py file [puppet] - 10https://gerrit.wikimedia.org/r/294685 [10:07:30] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [10:07:38] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [10:08:13] (03CR) 10Hashar: "Issue was due to a new version of the python linter flake8 that got released yesterday. Unrelated to this patch and now fixed (by pinning" [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [10:08:14] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [10:08:25] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [10:08:28] (03CR) 10Hashar: "Issue was due to a new version of the python linter flake8 that got released yesterday. Unrelated to this patch and now fixed (by pinning " [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [10:08:30] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Remove old urls.py file [puppet] - 10https://gerrit.wikimedia.org/r/294685 (owner: 10Alexandros Kosiaris) [10:09:26] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:09:54] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [10:10:05] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5220557 keys - replication_delay is 0 [10:11:25] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.057 second response time [10:14:49] !log scb1001 disabling puppet for a while to manually test changeprop with transclusion rules [10:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:16] 06Operations, 13Patch-For-Review: revisit swift (sys)logging - https://phabricator.wikimedia.org/T137397#2384891 (10fgiunchedi) [10:28:23] 06Operations, 10media-storage: swift backend machines load spike: cause and remediation - https://phabricator.wikimedia.org/T84385#2384895 (10fgiunchedi) [10:28:46] 06Operations, 10media-storage, 13Patch-For-Review: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972#2384899 (10fgiunchedi) [10:28:48] 06Operations, 10media-storage: swift backend machines load spike: cause and remediation - https://phabricator.wikimedia.org/T84385#926618 (10fgiunchedi) [10:29:20] (03CR) 10Muehlenhoff: "firejail logs to stdout/stderr, but the systemd file still needs to be updated to use StandardError/StandardOutput" [puppet] - 10https://gerrit.wikimedia.org/r/294499 (owner: 10Mobrovac) [10:29:45] 06Operations, 10media-storage: swift backend machines load spike: cause and remediation - https://phabricator.wikimedia.org/T84385#926618 (10fgiunchedi) what's left here is xfs bug(s) sending load average through the root, blocking with {T117972} to be checked again once we're running on linux 4.x [10:31:49] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2384904 (10Gehel) [10:38:03] (03PS1) 10Filippo Giunchedi: swift: enable statsd for all daemons [puppet] - 10https://gerrit.wikimedia.org/r/294691 [10:40:25] 06Operations: investigate why swift container server takes so much cpu - https://phabricator.wikimedia.org/T82850#2384912 (10fgiunchedi) [10:41:27] 06Operations: investigate why swift container server takes so much cpu - https://phabricator.wikimedia.org/T82850#906177 (10fgiunchedi) 05Open>03Invalid not sure there's anything we can do on this old bug, afaik we haven't experienced cpu problems with container server on the current swift fleet, resolving [10:41:54] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [10:41:55] 06Operations, 10ops-codfw, 10media-storage: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T137439#2384919 (10fgiunchedi) 05Open>03Invalid duplicate of {T136630} [10:43:11] 06Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2384926 (10fgiunchedi) [10:43:13] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup new host graphite2002 - https://phabricator.wikimedia.org/T130938#2384924 (10fgiunchedi) 05Open>03Resolved machine is in service, resolving [10:43:31] (03PS2) 10BBlack: force HTTPS when x-forwarded-for header is set [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [10:44:50] !log depooling mw1154 for kernel update/reboot [10:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:44:54] (03CR) 10BBlack: [C: 032] force HTTPS when x-forwarded-for header is set [puppet] - 10https://gerrit.wikimedia.org/r/294653 (owner: 1020after4) [10:51:34] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [10:54:46] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [10:56:05] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:59:05] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: puppet fail [11:01:28] (03CR) 10Alexandros Kosiaris: "The reason is that apart from the hideous syntax, it also can mess up defaults across multiple scopes. https://docs.puppet.com/puppet/3.5/" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [11:01:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [11:06:11] (03CR) 10Alexandros Kosiaris: [C: 031] "+1 from me. We should start looking at integrating systemd's security directives (http://0pointer.de/blog/projects/security.html) to compl" [puppet] - 10https://gerrit.wikimedia.org/r/294483 (https://phabricator.wikimedia.org/T121756) (owner: 10Muehlenhoff) [11:06:15] PROBLEM - puppet last run on mw2175 is CRITICAL: CRITICAL: puppet fail [11:17:25] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:18:37] (03PS1) 10Giuseppe Lavagetto: salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 [11:19:54] (03CR) 10jenkins-bot: [V: 04-1] salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 (owner: 10Giuseppe Lavagetto) [11:22:46] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:24:55] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:27:05] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [11:34:15] RECOVERY - puppet last run on mw2175 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:37:52] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2286195 (10jcrespo) [11:46:14] (03CR) 10Gehel: "I'm not a big fan of 'create_resources()'. It breaks what little compile time checks there are in Puppet. Let's wait for Puppet 4 and "Per" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [11:48:31] (03PS3) 10Gehel: Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 [11:51:46] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:53:44] ores.wikimedia.org is down [11:53:50] it seems it got overloaded [11:53:57] https://ores.wikimedia.org/v2/scores/enwiki/?models=damaging&revids=724030089 [11:54:08] 1- I haven't done anything [11:54:12] 2- let's fix it [11:55:01] https://grafana.wikimedia.org/dashboard/db/ores [11:55:12] it has been done for about 8 hours [11:55:16] *down [11:56:02] akosiaris: ^ [11:58:54] wikidatawiki and fawiki are depending on this, https://logstash.wikimedia.org/#/dashboard/elasticsearch/default [12:03:06] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [12:06:18] Amir1: wow, why on earth nothing alerted us of that ... [12:06:35] :((( [12:06:56] let's fix it and then fix the icigna [12:07:38] WARNING:ores.score_processors.celery -- Queue size is too full 229 [12:07:39] ? [12:07:46] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.253 second response time [12:07:52] all that precaching ? [12:08:08] It can be [12:08:21] one of my friends was running some stats on it too [12:08:43] hmm, so icinga says ores is returning an OK [12:08:55] he doesn't know about performance throttling :( I think it was combination of them [12:08:56] but that is because it is only checking one url [12:09:52] akosiaris: I think a restart would bring everything back up, probably, it will get huge requests from extensions and precaching [12:10:11] but overall it should handle that much [12:10:25] RECOVERY - HHVM rendering on mw1137 is OK: HTTP OK: HTTP/1.1 200 OK - 66520 bytes in 0.170 second response time [12:10:42] !log restarted hhvm on mw1137, got stuck [12:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:11:36] Amir1: https://phabricator.wikimedia.org/T137804 should fix the icinga issue [12:11:43] probably we need to dsiable precaching for wikidata [12:12:12] Amir1: yes [12:12:19] at least that much is obvious from the logs [12:12:48] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service: ORES should advertise swagger specs under /?spec - https://phabricator.wikimedia.org/T137804#2385011 (10Ladsgroup) [12:13:35] https://phabricator.wikimedia.org/tag/revision-scoring-as-a-service/ is different from https://phabricator.wikimedia.org/tag/ores/ ? [12:14:00] well, duh, but I though just adding ORES to the task would be enough [12:14:17] revision... is an umbrella project for all of products, the extension, wikilabels [12:14:44] ah my choice makes sense then [12:14:47] we do it to keep track of what we did (we treat the board like the Trello board) [12:16:12] Amir1: definitely stop the wikidatawiki precaching [12:16:19] lemme know when it's done [12:16:38] doing it is rather easy [12:16:39] ACKNOWLEDGEMENT - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors Gehel Probably related to a commit from gehel, cheking right now [12:17:55] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:18:47] (03CR) 10Alexandros Kosiaris: "not sure I follow what you mean by "It breaks what little compile time checks there are in Puppet"" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [12:18:56] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:26] (03CR) 10Gehel: "For example, if I have a typo in a param name, puppet will complain at compile time and tell me I'm trying to set a param that does not ex" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [12:20:37] akosiaris: https://gerrit.wikimedia.org/r/#/c/294699/ [12:20:57] please review and then we try to deploy [12:21:18] in wmflabs setup it didn't get precaching issue [12:23:33] (03PS1) 10Gehel: Fix team name for icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/294700 [12:26:13] (03CR) 10Gehel: [C: 032] Fix team name for icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/294700 (owner: 10Gehel) [12:26:55] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:29:04] (03CR) 10Alexandros Kosiaris: "I don't think that is true. For example" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [12:30:26] Amir1: I am having problems understanding that change [12:30:44] what does that do ? disable the ?precache=true parameter ? [12:31:18] or something else ? [12:31:36] no, it disables it in deamon halfak is running [12:31:47] but that needs to be restart manually [12:31:52] facepalm [12:32:07] so, there is a precaching daemon running somewhere [12:32:12] !log manually restarted celery-ores-worker in scb1002 [12:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:20] !log installing apache2 trusty update on graphite1001 [12:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:25] akosiaris: yup [12:32:43] and just shares the same config as ores [12:32:49] as the rest of ores, anyway [12:33:10] akosiaris: yeah [12:33:27] !log manually restarted celery-ores-worker in scb1001 [12:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:44] ores is back online for now: https://ores.wikimedia.org/v2/scores/enwiki/?models=damaging&revids=724030089 [12:34:24] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [12:34:26] Amir1: akosiaris: fyi, i've got puppet disabled on scb1001 [12:34:29] (03CR) 10Gehel: "I stand corrected. I remember having that issue multiple times, but that might have been older Puppet version (or my memory going bad). So" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [12:34:51] akosiaris: is there a way to refuse requests that are coming too much for an outside source [12:35:12] specially I'm talking about my friends who tried to run some stats in a very bad way [12:37:08] Amir1: yes, there exists a possibility for rate limiting [12:38:20] on docs on how to implement it or only Ops can do it? [12:38:45] (03CR) 10Zfilipin: "What is the status of this patch? Are you working on it? Should it be abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [12:40:17] Amir1: no, it exist on the varnish level and is based on token bucket filters. IIRC it was disabled due to some problems it created. I 'll ping bblack to see if it makes sense to re-enable it [12:40:25] (03CR) 10Zfilipin: "What is the status of this patch? Are you working on it? Should it be abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/276733 (owner: 10Hashar) [12:43:01] (03CR) 10Alexandros Kosiaris: "I 'll grant you that, but in practice, whenever you go for create_resources, you 've already put the data in a data structure (a hash). Th" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [12:43:05] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:43:05] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:45:03] jynus: if you're around. I finally got around the performance issue. https://gerrit.wikimedia.org/r/#/c/294693/2 [12:45:43] fortunately no need to for schema changes for now ( it would be great if we can have one later) [12:45:51] nice, did you test it on labs/production [12:46:07] e.g. running EXPLAIN on the resulting query [12:46:09] yup, in labs [12:46:20] in beta cluster [12:46:22] (03CR) 10Gehel: "Interesting conversation! I think that's another weak point of Puppet. The strict boundary between code and config is puppet code vs hiera" [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [12:46:26] got much faster [12:46:38] give it a try on production too, to be sure [12:46:47] of course [12:46:59] I want to get it though in SWAT [12:47:07] query plans tend to be very different when there is a lot of data [12:47:15] I can do it if you give me the resulting query [12:47:20] just update the ticket [12:48:08] (03PS1) 10BBlack: r::c::ssl::unified: set explicit server name www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294703 (https://phabricator.wikimedia.org/T107236) [12:48:10] (03PS1) 10BBlack: r::c::ssl: use 3127 for upstream_port [puppet] - 10https://gerrit.wikimedia.org/r/294704 (https://phabricator.wikimedia.org/T107236) [12:48:12] (03PS1) 10BBlack: vhtcpd: use port 3127 for fe [puppet] - 10https://gerrit.wikimedia.org/r/294705 (https://phabricator.wikimedia.org/T107236) [12:48:14] (03PS1) 10BBlack: tlsproxy: redirect-only service on 8080 [puppet] - 10https://gerrit.wikimedia.org/r/294706 (https://phabricator.wikimedia.org/T107236) [12:49:25] apergos, I am a bit lost on T29112 [12:49:26] T29112: Select of revisions for stub history files does not explicitly order revisions - https://phabricator.wikimedia.org/T29112 [12:49:45] too much info [12:49:47] jynus: there's a select without an order-by [12:49:55] jynus: it would be great, just instead of LEFT JOIN on ores_classification, run INNER JOIN [12:49:58] (03CR) 10Alexandros Kosiaris: [C: 031] Add the ability to configure contact group for check of services. [puppet] - 10https://gerrit.wikimedia.org/r/294507 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [12:50:04] (03PS4) 10Gehel: Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 [12:50:13] adding the order by creates a filesort? [12:50:17] this used to not be an issue because sort of by luck most entries came back in prev id order within pages [12:50:21] *rev id order [12:50:23] (03CR) 10jenkins-bot: [V: 04-1] r::c::ssl: use 3127 for upstream_port [puppet] - 10https://gerrit.wikimedia.org/r/294704 (https://phabricator.wikimedia.org/T107236) (owner: 10BBlack) [12:50:25] now that's not true at all [12:50:38] sometimes that can be avoided with the right index, but only sometimes [12:50:41] so I need to add that explicit ordering... [12:50:51] (03CR) 10jenkins-bot: [V: 04-1] Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [12:51:08] but if we do it on 500k pages that's liable to be waaaaay too many revs to do in memory [12:51:19] that could be some millions of revs [12:51:25] (03CR) 10jenkins-bot: [V: 04-1] tlsproxy: redirect-only service on 8080 [puppet] - 10https://gerrit.wikimedia.org/r/294706 (https://phabricator.wikimedia.org/T107236) (owner: 10BBlack) [12:51:27] Is "SELECT * FROM page INNER JOIN revision ON ((page_id=rev_page)) WHERE page_id >= 1157 AND page_id < 1158 ORDER BY page_id ASC, revision.rev_id ASC;" the canonical example? [12:51:34] so I can play? [12:51:39] no, it's not at all [12:51:42] and that's the problem [12:51:46] oh [12:52:11] is the one in the description the original one [12:52:12] ? [12:52:31] you can send me code also, if that is easier [12:52:36] page id ranges from say 1 to 5000 [12:52:49] (03PS5) 10Gehel: Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 [12:52:57] but really [12:53:00] (that's the output) [12:53:14] the stubs it rangers from page 1 to probably 500k for the first query [12:53:18] I just need somewhere where to start, and then I can follow with options [12:53:24] then 500001 to 100k for the next and so on [12:53:29] start with those [12:53:32] ok, that is a good idea [12:53:34] you'll see what I mean right away [12:53:44] probably the range + the orderby breaks the proformance [12:53:50] which is the typical case [12:54:06] indeed [12:54:07] let me do some tests and I will go back with what I find [12:54:08] (03CR) 10jenkins-bot: [V: 04-1] Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 (owner: 10Gehel) [12:54:15] ok, thanks, if I can help please let me know [12:54:53] what I usually do is give the results, and send 1 or a couple of recommendations, and then you can take it from there [12:55:24] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.966 second response time [12:55:45] one last question, I see you doing tests with elwiktionary and elwiki [12:55:46] (03PS6) 10Gehel: Fixed typo (cron instead of from). [puppet] - 10https://gerrit.wikimedia.org/r/294679 [12:55:58] but I suupose it will apply to all wikis, right? [12:55:58] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service: ORES should advertise swagger specs under /?spec - https://phabricator.wikimedia.org/T137804#2385088 (10Ladsgroup) p:05Triage>03Unbreak! [12:57:47] I see now springle's comment, which points me in the right direction [12:57:51] !log rebalancing shards on elasticsearch equiad cluster [12:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:59] jynus: please tell me once you tested it, so I purse this direction [13:02:15] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:57] Amir1, I will have a look when I have the time, I am with aperg*s issue right now [13:03:09] if it is on the ticket, it will not be forgotten [13:03:24] (03PS2) 10Gehel: Add the ability to configure contact group for check of services. [puppet] - 10https://gerrit.wikimedia.org/r/294507 (https://phabricator.wikimedia.org/T137869) [13:03:40] okay sure [13:04:42] !log scb1001 enabled puppet back [13:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:26] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:10:40] (03CR) 10Gehel: [C: 032] Add the ability to configure contact group for check of services. [puppet] - 10https://gerrit.wikimedia.org/r/294507 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [13:10:54] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 9.624 second response time [13:11:12] apergos, one last question, what are some "acceptable" and "unaceptable" times per X rows, to compare with the result I get? [13:11:38] I don't know those numbers [13:11:52] if you do it without order by (as the code now is) that's "acceptable" I guess [13:11:55] (I do not need an exact time, just "X takes half an hour when before it took a few minutes" [13:12:22] what we have now in the dumps is that instead of getting 95% of revision content from the old dumps on disk we ask the db [13:12:35] that's a side effect of this missing explicit ordering [13:12:46] so somehow I need to add this explicit ordering on there without doing the server harm [13:12:50] that's where I need your hepl [13:13:00] ok, I will give you the number I get, and you can decide if that is ok [13:13:06] on the ticket [13:13:09] ok [13:15:35] (03PS2) 10BBlack: r::c::ssl::unified: set explicit server name www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294703 (https://phabricator.wikimedia.org/T107236) [13:15:37] (03PS2) 10BBlack: tlsproxy: redirect-only service on 8080 [puppet] - 10https://gerrit.wikimedia.org/r/294706 (https://phabricator.wikimedia.org/T107236) [13:15:39] (03PS2) 10BBlack: r::c::ssl: use 3127 for upstream_port [puppet] - 10https://gerrit.wikimedia.org/r/294704 (https://phabricator.wikimedia.org/T107236) [13:15:41] (03PS2) 10BBlack: vhtcpd: use port 3127 for fe [puppet] - 10https://gerrit.wikimedia.org/r/294705 (https://phabricator.wikimedia.org/T107236) [13:16:24] (03PS2) 10Giuseppe Lavagetto: salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 [13:16:55] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [13:17:36] (03CR) 10BBlack: "It's a little bit stalled while we try to figure out the long-term stuff on how to integrate CI with VCL tests better, but probably should" [puppet] - 10https://gerrit.wikimedia.org/r/276733 (owner: 10Hashar) [13:18:15] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: puppet fail [13:19:41] (03CR) 10jenkins-bot: [V: 04-1] salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 (owner: 10Giuseppe Lavagetto) [13:23:10] <_joe_> I hate you pep8 [13:24:07] _joe_: don't worry, pep8 hates you too [13:26:05] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:27:31] (03PS3) 10Giuseppe Lavagetto: salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 [13:34:30] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [13:39:03] 06Operations, 06Discovery, 06Services, 03Maps-Sprint, 13Patch-For-Review: Allow configuration of contact groups for monitoring of services - https://phabricator.wikimedia.org/T137891#2385214 (10Gehel) 05Open>03Resolved [13:43:54] (03CR) 10BBlack: [C: 031] "+1 for usefulness of enable/disable. Can we push this as an upstream patch to salt too?" [puppet] - 10https://gerrit.wikimedia.org/r/294694 (owner: 10Giuseppe Lavagetto) [13:44:02] RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:50:52] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [13:51:18] (03CR) 10BBlack: [C: 032] r::c::ssl::unified: set explicit server name www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294703 (https://phabricator.wikimedia.org/T107236) (owner: 10BBlack) [13:51:45] (03PS1) 10Alex Monk: Simplify the VE RB URL config some more, now that we no longer use wgServerName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294713 [13:57:40] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:01:00] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: puppet fail [14:04:04] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2385267 (10Papaul) @fgiunchedi Yes there are SSD [14:07:50] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [14:10:36] (03CR) 10JanZerebecki: "If my memory serves me right building on testing is fine in this case. We could upload the debs required for building from testing to jess" [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [14:11:43] Krenair: do you have a ref about the RB and wgServerName stuff somewhere? I don't get the "now that we no longer use" part, but sounds related to https://phabricator.wikimedia.org/T127370#2042629 ? [14:13:01] https://gerrit.wikimedia.org/r/#/c/291349/ [14:16:41] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:17:15] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2385298 (10fgiunchedi) thanks for the context @mmodell ! Since we're using the same package names as Debian we should ensur... [14:23:50] (03PS1) 10Andrew Bogott: Add the instance tld (e.g. 'wmflabs') to designate and horizon config. [puppet] - 10https://gerrit.wikimedia.org/r/294716 (https://phabricator.wikimedia.org/T91990) [14:25:11] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2385312 (10mmodell) I'm not picky about the versioning and right now I can't think of anything that would be different depe... [14:26:05] (03CR) 10BBlack: [C: 032] r::c::ssl: use 3127 for upstream_port [puppet] - 10https://gerrit.wikimedia.org/r/294704 (https://phabricator.wikimedia.org/T107236) (owner: 10BBlack) [14:26:28] (03CR) 10BBlack: [C: 032] vhtcpd: use port 3127 for fe [puppet] - 10https://gerrit.wikimedia.org/r/294705 (https://phabricator.wikimedia.org/T107236) (owner: 10BBlack) [14:26:51] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:28:34] (03PS2) 10Andrew Bogott: Add the instance tld (e.g. 'wmflabs') to designate and horizon config. [puppet] - 10https://gerrit.wikimedia.org/r/294716 (https://phabricator.wikimedia.org/T91990) [14:29:20] 06Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#2385375 (10Aklapper) One year later: Still happening? Or obsolete / declined? [14:29:59] 06Operations, 10Traffic, 13Patch-For-Review: Switch port 80 to nginx on primary clusters - https://phabricator.wikimedia.org/T107236#2385376 (10BBlack) [14:31:24] !log re-enabled and ran puppet agent --test on iridium. Everything appears to be normal. [14:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:41] (03PS3) 10Andrew Bogott: Add the instance tld (e.g. 'wmflabs') to designate and horizon config. [puppet] - 10https://gerrit.wikimedia.org/r/294716 (https://phabricator.wikimedia.org/T91990) [14:33:54] 06Operations, 10Traffic, 10Wikimedia-Stream, 07HTTPS: stream.wikimedia.org doesn't redirect to HTTPS - https://phabricator.wikimedia.org/T137915#2385381 (10BBlack) Note: based on simple test python and javascript clients, websocket client libraries tend to not support 301 redirects to HTTPS. So we'll prob... [14:34:32] (03CR) 10Andrew Bogott: [C: 032] Add the instance tld (e.g. 'wmflabs') to designate and horizon config. [puppet] - 10https://gerrit.wikimedia.org/r/294716 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [14:35:42] 06Operations, 10Mail, 10MediaWiki-Watchlist: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#2385389 (10Aklapper) >>! In T121105#2046743, @Dzahn wrote: > Could i have another pair of eyes here please? I don't really see a pattern here Wondering if it's another c... [14:35:56] (03CR) 10Jforrester: [C: 031] Simplify the VE RB URL config some more, now that we no longer use wgServerName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294713 (owner: 10Alex Monk) [14:36:28] PROBLEM - mediawiki-installation DSH group on mw2244 is CRITICAL: Host mw2244 is not in mediawiki-installation dsh group [14:36:28] PROBLEM - mediawiki-installation DSH group on mw2245 is CRITICAL: Host mw2245 is not in mediawiki-installation dsh group [14:36:28] PROBLEM - mediawiki-installation DSH group on mw2241 is CRITICAL: Host mw2241 is not in mediawiki-installation dsh group [14:36:28] PROBLEM - mediawiki-installation DSH group on mw2242 is CRITICAL: Host mw2242 is not in mediawiki-installation dsh group [14:36:59] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2385401 (10mmodell) @fgiunchedi can you handle tagging a version according to your 0~git-0wmf1 scheme then? Or should... [14:37:18] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Puppet has 9 failures [14:37:18] PROBLEM - puppet last run on mw2242 is CRITICAL: CRITICAL: Puppet has 9 failures [14:39:28] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Puppet has 9 failures [14:41:14] (03PS1) 10Andrew Bogott: Include eqiad/codfw in INSTANCE_TLD [puppet] - 10https://gerrit.wikimedia.org/r/294718 (https://phabricator.wikimedia.org/T91990) [14:41:38] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [14:42:43] (03CR) 10Andrew Bogott: [C: 032] Include eqiad/codfw in INSTANCE_TLD [puppet] - 10https://gerrit.wikimedia.org/r/294718 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [14:43:18] twentyafterfour: yup I can change the version and tag, do you prefer a gerrit review or differential? [14:43:48] twentyafterfour: I have a couple of changes to debian/control to review too [14:43:49] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Puppet has 9 failures [14:44:29] PROBLEM - Apache HTTP on mw2241 is CRITICAL: Connection refused [14:47:41] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2385440 (10BBlack) As discussed in email, now that we're past the first deadline date and we've been posting username lists on public wikis... [14:47:41] godog: either way, it's got an arcconfig but gerrit is fine if you prefer [14:49:39] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, 13Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2385447 (10BBlack) Latest list of accounts still making insecure requests over the past ~24H: T136674#2385440 [14:49:54] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2385451 (10hashar) For Zuul package I am using `2.1.0-151-g30a433b-wmf2precise1` where: | 2.1.0 | Upstream tag | 151-g30a4... [14:50:01] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2385452 (10Gehel) >>! In T137869#2382475, @Joe wrote: > We should add a service check for karthoterian using service_checker on the lvs IP, pretty much as we d... [14:51:00] PROBLEM - Apache HTTP on mw2244 is CRITICAL: Connection refused [14:51:44] twentyafterfour: ok! thanks, https://phabricator.wikimedia.org/D268 [14:52:31] (03PS1) 10Andrew Bogott: Hm, I don't know what designateconfig['dhcp_domain'] is but what I want here is $::site [puppet] - 10https://gerrit.wikimedia.org/r/294719 [14:55:19] PROBLEM - Apache HTTP on mw2242 is CRITICAL: Connection refused [14:55:36] (03CR) 10Andrew Bogott: [C: 032] "This time I actually tested with the puppet compiler, and this now does what I want." [puppet] - 10https://gerrit.wikimedia.org/r/294719 (owner: 10Andrew Bogott) [15:00:04] anomie, ostriches, thcipriani, marktraceur, and aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160616T1500). [15:00:04] Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:21] o/ [15:01:19] "the time has come" lol [15:02:28] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:02:48] I can SWAT today [15:03:10] Amir1: I am reviewing changes now, give me a moment :) [15:03:26] sure, you're awesome [15:03:37] PROBLEM - Host mw2246 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:52] !log root@palladium conftool action : set/pooled=yes; selector: name=mw1262.eqiad.wmnet [15:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:26] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2385489 (10Gehel) >>! In T137869#2382475, @Joe wrote: > We should add a service check for karthoterian using service_checker on the lvs IP, pretty much as we do for other servic... [15:05:58] PROBLEM - DPKG on mw1291 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:06:30] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/294669 (owner: 10Mobrovac) [15:06:34] <_joe_> moritzm: ^^ that you? [15:07:04] yeah, fix is currently building [15:07:50] <_joe_> ok :P [15:11:28] PROBLEM - Apache HTTP on mw2245 is CRITICAL: Connection refused [15:15:34] PROBLEM - MariaDB disk space on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:17:08] RECOVERY - DPKG on mw1291 is OK: All packages OK [15:20:05] PROBLEM - MariaDB disk space on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:37] (03PS1) 10Gehel: Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) [15:20:47] 06Operations: ffmpeg/libav on jessie video scalers - https://phabricator.wikimedia.org/T137886#2385556 (10MoritzMuehlenhoff) 05Open>03Resolved The following packages have been built for jessie-wikimedia and uploaded to apt.wikimedia.org: libtheora 1.2.0~git+20150816-1+wmf1 ffmpeg2theora 0.30-1+wmf1 chromap... [15:21:04] !log thcipriani@tin Synchronized php-1.28.0-wmf.6/extensions/ORES: SWAT: [[gerrit:294711|Skip when an edit is errored in PopulateDatabase.php]] (duration: 00m 30s) [15:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:08] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:25] ^ Amir1 patch1 sync'd, check if possible please [15:21:37] not possible :) [15:21:45] maintenance script [15:22:16] we actually passed it through SWAT before but that was for wmf.5 [15:22:35] ack, that's what I figured :) [15:23:44] !log rolling reboot of restbase1008 - restbase1011 for upgrade to Linux 4.4 [15:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:09] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:27:16] !log thcipriani@tin Synchronized php-1.28.0-wmf.6/extensions/ORES/includes/Hooks.php: SWAT: [[gerrit:294712|Performance boost on hidenondamaging]] (duration: 00m 35s) [15:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:20] ^ Amir1 check please [15:27:24] sure [15:28:05] request responded in 1.68 sec instead of 22 [15:28:12] jynus: ^ [15:29:20] (that is for the whole page, not the db query which definitely took very shortly) [15:30:18] thcipriani: i.e. it's working like a charm [15:30:19] thanks [15:30:32] Amir1: glad to hear, thanks for checking :) [15:30:50] Amir1, you may want to involve performance team for page loading tips [15:31:15] but it is ok for now [15:31:33] yeah [15:31:48] * Amir1 goes afk for dancing in WMDE office :D [15:35:55] RECOVERY - MariaDB disk space on labsdb1003 is OK: DISK OK [15:35:58] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2385605 (10Gehel) In term of production support, we seem to be good to go once https://gerrit.wikimedia.org/r/#/c/294723/ is merged. LVS will be paging. We ca... [15:37:22] !log deleted sqldata.s6 from labsdb1008 - space issues caused by queries creating temporary tables [15:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:15] (03PS1) 10Mobrovac: Change Prop: increase concurrency to 50 [puppet] - 10https://gerrit.wikimedia.org/r/294726 [15:45:28] RECOVERY - Host mw2246 is UP: PING OK - Packet loss = 0%, RTA = 37.18 ms [15:45:58] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:47:10] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2385618 (10mmodell) Unfortunately phabricator doesn't have any upstream version tags. [15:49:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] Team-interactive receives maps alerts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [15:49:58] PROBLEM - configured eth on mw2246 is CRITICAL: Connection refused by host [15:50:19] PROBLEM - Check size of conntrack table on mw2246 is CRITICAL: Connection refused by host [15:50:28] PROBLEM - dhclient process on mw2246 is CRITICAL: Connection refused by host [15:50:38] PROBLEM - DPKG on mw2246 is CRITICAL: Connection refused by host [15:50:57] PROBLEM - Disk space on mw2246 is CRITICAL: Connection refused by host [15:51:17] PROBLEM - nutcracker port on mw2246 is CRITICAL: Connection refused by host [15:51:18] PROBLEM - MD RAID on mw2246 is CRITICAL: Connection refused by host [15:51:28] PROBLEM - nutcracker process on mw2246 is CRITICAL: Connection refused by host [15:51:47] PROBLEM - puppet last run on mw2246 is CRITICAL: Connection refused by host [15:51:59] PROBLEM - salt-minion processes on mw2246 is CRITICAL: Connection refused by host [15:52:09] (03CR) 10Gehel: Team-interactive receives maps alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [15:53:51] (03PS2) 10Gehel: Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) [15:54:18] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [15:59:05] (03PS2) 10Mobrovac: Change Prop: increase concurrency to 50 [puppet] - 10https://gerrit.wikimedia.org/r/294726 (https://phabricator.wikimedia.org/T137902) [15:59:33] pretty big uptick in text 500-errors just recently.... [16:00:02] starts around 15:37 but doesn't hit its full stride until a few minutes ago [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160616T1600). Please do the needful. [16:00:04] tgr and mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:24] (03PS2) 10RobH: adding user joewalsh to cluster access [puppet] - 10https://gerrit.wikimedia.org/r/294093 (https://phabricator.wikimedia.org/T137110) [16:00:35] (03CR) 10jenkins-bot: [V: 04-1] Change Prop: increase concurrency to 50 [puppet] - 10https://gerrit.wikimedia.org/r/294726 (https://phabricator.wikimedia.org/T137902) (owner: 10Mobrovac) [16:00:58] i changed nothing in the patch, only the commit msg, how is that possible? [16:01:36] jenkins likes to scold people [16:01:37] (03CR) 10Mobrovac: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/294726 (https://phabricator.wikimedia.org/T137902) (owner: 10Mobrovac) [16:01:53] apparently [16:01:53] _joe_ moritzm I can SWAT [16:02:02] a lot of the 500s are coming from RB apparently [16:02:03] mobrovac: "Gem::RemoteFetcher::UnknownHostError: no such name (https://rubygems.org/gems/rspec-mocks-3.4.1.gem)" [16:02:14] the joy of pulling unversioned stuff from the internet! [16:02:25] sigh [16:02:40] hashar: seems jake-jessie also broke [16:02:45] hashar: seems rake-jessie also broke [16:02:55] godog: ok [16:03:54] bblack: https://grafana-admin.wikimedia.org/dashboard/db/restbase?panelId=17&fullscreen shows 7 reqs/sec of 5xx [16:04:02] PROBLEM - mediawiki-installation DSH group on mw2246 is CRITICAL: Host mw2246 is not in mediawiki-installation dsh group [16:04:09] hmm mobile-sections [16:04:40] (03CR) 10Filippo Giunchedi: [C: 04-1] Handle invalid DB name in 'sql' shell script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294496 (owner: 10Gergő Tisza) [16:04:43] tgr: you're up, ^ [16:04:56] o/ [16:05:03] moritzm: are you restarting cassandra? [16:05:21] mobrovac: time pattern fits what I see on cache_text [16:05:40] <_joe_> godog: thanks, my bandwidth is not getting any better [16:06:36] (03PS2) 10Gergő Tisza: Handle invalid DB name in 'sql' shell script [puppet] - 10https://gerrit.wikimedia.org/r/294496 [16:06:44] godog: ^ [16:06:45] _joe_: np, I was tempted to make a joke about wind the operator heh [16:06:51] moritzm: urandom: Error: Cannot achieve consistency level LOCAL_QUORUM [16:07:12] lots and lots of those in the logs [16:07:16] bblack: probably ^^^ [16:07:49] (03PS3) 10Filippo Giunchedi: Handle invalid DB name in 'sql' shell script [puppet] - 10https://gerrit.wikimedia.org/r/294496 (owner: 10Gergő Tisza) [16:07:50] mobrovac: should I just merge the changeprop change? [16:07:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Handle invalid DB name in 'sql' shell script [puppet] - 10https://gerrit.wikimedia.org/r/294496 (owner: 10Gergő Tisza) [16:08:03] paravoid: sure [16:08:07] (03PS3) 10Faidon Liambotis: Change Prop: increase concurrency to 50 [puppet] - 10https://gerrit.wikimedia.org/r/294726 (https://phabricator.wikimedia.org/T137902) (owner: 10Mobrovac) [16:08:20] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Change Prop: increase concurrency to 50 [puppet] - 10https://gerrit.wikimedia.org/r/294726 (https://phabricator.wikimedia.org/T137902) (owner: 10Mobrovac) [16:08:33] done [16:08:37] thnx! [16:08:37] tgr: {{done}} [16:08:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "2 minor comments, otherwise LGTM. Feel free to merge after fixing comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [16:08:44] godog: thanks! [16:10:01] mobrovac: indirectly, by means of the restbase1008-1011 reboots, but I have waited between individual reboots (and only one at a time) [16:14:15] moritzm mobrovac urandom mhh only restbase1011-b reported down now though [16:16:23] 07Blocked-on-Operations, 06Operations, 10Monitoring, 06Services: Update restbase catchpoint metric - https://phabricator.wikimedia.org/T137181#2385714 (10mark) a:03mark [16:16:44] yeah, all others are in UN [16:17:40] these are still ongoing (cannot achieve quorum) [16:18:47] godog: moritzm: urandom: somebody trying to revive it? [16:18:51] odd, not from restbase1011's perspective [16:19:15] yeah I'll try to drain cassandra instances on 1011 [16:19:29] 1011 is still depooled BTW [16:19:48] rb1009-b 1014-a and 1014-b are DN [16:19:59] godog: ^ [16:20:12] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [16:20:21] mobrovac: from where? [16:20:28] only from 1011 [16:21:31] ok godog rb1011 is definitely the problem, it sees these as down, but all others think everybody is UN [16:21:52] mobrovac: yup, but looks like it has converged just now?! [16:22:51] hm interesting [16:22:53] indeed [16:24:54] mobrovac: also looks like quorum messages from RB are not there anymore? [16:25:12] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 96 seconds ago with 0 failures [16:25:36] (03PS3) 10Gehel: Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) [16:26:03] godog: can't load logstash the last 5 mins, so no idea [16:26:05] (03PS1) 10Giuseppe Lavagetto: scap: add new appservers [puppet] - 10https://gerrit.wikimedia.org/r/294735 [16:26:40] godog: ok, loaded, it looks stabilised now [16:26:54] bblack: confirm the 5xx rate is down now? [16:27:11] (03Abandoned) 10Gehel: Add interactive-team to default Icinga notification group for maps servers [puppet] - 10https://gerrit.wikimedia.org/r/294503 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [16:28:02] mobrovac: seems to be so far [16:28:16] mobrovac, godog: I'll repool 1011, then? [16:28:25] (03PS1) 10Jcrespo: Set all new slaves to medium weight (300) after warm up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294736 [16:28:29] moritzm: let's give it 5 mins to be sure [16:28:40] (03CR) 10Giuseppe Lavagetto: [C: 032] scap: add new appservers [puppet] - 10https://gerrit.wikimedia.org/r/294735 (owner: 10Giuseppe Lavagetto) [16:29:21] (03CR) 10Jcrespo: [C: 032] Set all new slaves to medium weight (300) after warm up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294736 (owner: 10Jcrespo) [16:30:04] ok [16:30:26] (03PS3) 10RobH: adding user joewalsh to cluster access [puppet] - 10https://gerrit.wikimedia.org/r/294093 (https://phabricator.wikimedia.org/T137110) [16:31:12] _joe_, should I wait 1 minute for scap? [16:31:38] moritzm: kk, feel free to repool it now [16:31:52] PROBLEM - NTP on mw2246 is CRITICAL: NTP CRITICAL: No response from NTP server [16:32:29] (03CR) 10RobH: [C: 032] "3 day wait has passed with no objections." [puppet] - 10https://gerrit.wikimedia.org/r/294093 (https://phabricator.wikimedia.org/T137110) (owner: 10RobH) [16:32:37] (03PS1) 10Giuseppe Lavagetto: conftool: add new jessie api appservers [puppet] - 10https://gerrit.wikimedia.org/r/294737 [16:32:46] moritzm: also I'd say more time between reboots, when "total hints" hits zero should be safe to proceed with the next one, e.g. in https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-storage [16:33:05] <_joe_> jynus: yes please [16:33:11] <_joe_> sorry I was preparing the other change [16:33:16] _joe_, ping me when done [16:33:29] godog: +1 [16:34:30] we can test it with my change- if some fail it is not a huge deal [16:34:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for joewalsh - https://phabricator.wikimedia.org/T137110#2385766 (10RobH) 05stalled>03Resolved a:05RobH>03None @JoeWalsh: Your access received no objections, so I've merged it live. While it... [16:34:58] <_joe_> jynus: green light [16:35:07] I've updated Service_restarts on wikitech to point to the dashboards [16:35:31] ok, let's do this- a simple change, we are just adding 15 new application servers and 15 new databases :-) [16:35:51] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [16:36:34] ACKNOWLEDGEMENT - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi check is flapping, see also https://phabricator.wikimedia.org/T137952 [16:36:55] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Set all new slaves to medium weight (300) after warm up (duration: 00m 25s) [16:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:04] he, he, _joe_ : Could not resolve hostname nw2241.codfw.wmnet: Name or service not known [16:37:39] godog: i have one RB patch for puppetswat, i think we're safe now to go with it [16:37:53] (03CR) 10Faidon Liambotis: "The kernel limits are per IP, so the number of connections LVS is handling isn't a (big) factor here. I'm assuming that for destunreach yo" [puppet] - 10https://gerrit.wikimedia.org/r/294467 (https://phabricator.wikimedia.org/T136939) (owner: 10Faidon Liambotis) [16:37:58] you fix it and pool while I check the dbs? [16:38:01] *pull [16:38:19] <_joe_> jynus: yes, grrrr [16:38:23] <_joe_> damn mac fonts [16:38:59] RECOVERY - mediawiki-installation DSH group on mw2245 is OK: OK [16:38:59] RECOVERY - mediawiki-installation DSH group on mw2242 is OK: OK [16:39:00] RECOVERY - mediawiki-installation DSH group on mw2244 is OK: OK [16:39:07] mobrovac: ack, LGTM [16:39:15] (03PS2) 10Filippo Giunchedi: RESTBase: Make sendind resource_change events optional [puppet] - 10https://gerrit.wikimedia.org/r/294669 (owner: 10Mobrovac) [16:39:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: Make sendind resource_change events optional [puppet] - 10https://gerrit.wikimedia.org/r/294669 (owner: 10Mobrovac) [16:39:57] godog: moritzm: hmm, the local_quorum problem seems to be back [16:40:30] (03PS1) 10Giuseppe Lavagetto: scap: s/nw2241/mw2241/ [puppet] - 10https://gerrit.wikimedia.org/r/294738 [16:41:00] (03PS2) 10Giuseppe Lavagetto: scap: s/nw2241/mw2241/ [puppet] - 10https://gerrit.wikimedia.org/r/294738 [16:41:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: s/nw2241/mw2241/ [puppet] - 10https://gerrit.wikimedia.org/r/294738 (owner: 10Giuseppe Lavagetto) [16:41:24] (03PS1) 10EBernhardson: Dependent config for textcat AB test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294739 [16:41:24] godog: did you run puppet for rb perhaps? [16:41:35] (03PS2) 10EBernhardson: search: Dependent config for textcat AB test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294739 [16:42:00] mobrovac: no, but it shouldn't affect anything even if ran I think [16:42:26] godog: sure, sure, was asking to know whether i should do so :) [16:42:39] k, i'll run it [16:42:56] haven't repooled 1011 yet, shall I withhold? [16:42:56] mobrovac: ack, thanks, I'm looking at logstash btw but don't see the quorum messages so far [16:43:21] godog: there seems to have been just a burst of them @ :38 [16:43:31] calmed down again [16:43:43] moritzm: i think you're good to go [16:44:13] <_joe_> jynus: fixed :) [16:44:19] yay [16:44:19] <_joe_> I am going off now [16:45:14] k, repooled 1011 [16:45:17] thnx [16:50:45] 06Operations, 03Discovery-Search-Sprint: Followup on elastic1026 blowing up May 9, 21:43-22:14 UTC - https://phabricator.wikimedia.org/T134829#2385821 (10EBernhardson) Are we going to do anything else with this ticket? Should move it to done? [16:53:00] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:39] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:11] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:54:19] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:54:30] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:54:30] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:20] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2253774 (10RobH) It sounds like a database script, and therefore falls to @jcrespo? (I don't want to leave this sitting with no attention, so j... [16:55:49] PROBLEM - Check size of conntrack table on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:10] twentyafterfour: I'm failing repeatedly to 'arc land' the patch with an error about libext/Sprint submodule not found, any way you can land it too? [16:56:33] 06Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#2385830 (10cscott) The threshold is pretty arbitrary, it just warns us maybe to have a look and see if anything is obviously wrong. We can bump the threshold higher if it seems that the warning is... [16:56:36] godog: sure thing [16:57:12] godog: btw: you can usually just merge and git push, phabricator figures it out [16:58:09] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:59:21] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:40] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160616T1700). Please do the needful. [17:00:09] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:10] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:17] twentyafterfour: oh ok, thanks! didn't know that [17:00:22] no parsoid deploy [17:00:31] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:41] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2385855 (10mmodell) [17:07:08] godog: https://phabricator.wikimedia.org/rPHDEP9101e9e9e520170215c9c2260f1ce0667773c5c1 [17:09:01] PROBLEM - Apache HTTP on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50408 bytes in 3.134 second response time [17:09:10] PROBLEM - HHVM rendering on mw1117 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.003 second response time [17:09:10] godog: autoclose didn't work but the patch is landed. [17:09:23] (I didn't have autoclose enabled on the debian branch) [17:11:19] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.188 second response time [17:11:29] RECOVERY - HHVM rendering on mw1117 is OK: HTTP OK: HTTP/1.1 200 OK - 67286 bytes in 0.300 second response time [17:11:52] (03CR) 10Tjones: [C: 031] "Everything looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294739 (owner: 10EBernhardson) [17:13:03] 06Operations, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10BBlack) [17:13:15] 06Operations, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385892 (10BBlack) p:05Triage>03Normal [17:13:26] 06Operations, 06Performance-Team, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10BBlack) [17:13:59] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2385894 (10mmodell) [17:15:31] 06Operations, 06Performance-Team, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385900 (10BBlack) [17:15:50] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 8 failures [17:16:28] 06Operations, 06Performance-Team, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10BBlack) [17:16:55] 07Blocked-on-Operations, 06Operations, 10Monitoring, 06Services: Update restbase catchpoint metric - https://phabricator.wikimedia.org/T137181#2385904 (10mark) I've added a copy of the old test for these changes, suffixed "UNCACHED". I'll leave the old cached (but now fixed up for article content) test in... [17:17:12] 07Blocked-on-Operations, 06Operations, 10Monitoring, 06Services: Update restbase catchpoint metric - https://phabricator.wikimedia.org/T137181#2385908 (10mark) p:05High>03Normal [17:17:19] RECOVERY - mediawiki-installation DSH group on mw2247 is OK: OK [17:18:21] RECOVERY - Disk space on mw1136 is OK: DISK OK [17:18:40] RECOVERY - mediawiki-installation DSH group on mw1278 is OK: OK [17:19:00] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:19:00] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [17:19:20] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [17:19:21] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [17:19:39] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [17:19:50] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [17:20:19] RECOVERY - DPKG on mw1136 is OK: All packages OK [17:20:20] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:21:20] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [17:24:18] (03CR) 10Yuvipanda: [C: 032] Introduce 'Backends' [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/292028 (owner: 10Yuvipanda) [17:24:41] (03CR) 10Yuvipanda: [C: 032] Add LICENSE [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/292056 (owner: 10Yuvipanda) [17:25:21] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2363735 (10greg) >>! In T137224#2382844, @mmodell wrote: >>>! In T137224#2381927, @Joe wrote: >> @20after4 do you thi... [17:25:49] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:25:49] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:09] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:09] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:21] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:26:21] (03PS30) 10Yuvipanda: Add a Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [17:26:40] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:27:00] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:27:10] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:27:26] (03CR) 10Yuvipanda: [C: 032] Add a Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 (owner: 10Yuvipanda) [17:27:29] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:31:51] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:09] RECOVERY - Check size of conntrack table on mw1136 is OK: OK: nf_conntrack is 0 % full [17:34:09] RECOVERY - Disk space on mw1136 is OK: DISK OK [17:34:40] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:34:40] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [17:34:59] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [17:35:00] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [17:35:20] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [17:35:31] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [17:35:40] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.528 second response time [17:35:51] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:35:59] RECOVERY - DPKG on mw1136 is OK: All packages OK [17:36:00] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:36:11] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 67293 bytes in 0.443 second response time [17:38:09] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:40:01] RECOVERY - mediawiki-installation DSH group on mw2241 is OK: OK [17:42:49] (03PS1) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 [17:44:13] 07Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 03Collab-Team-2016-Apr-Jun-Q4: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2386030 (10Mattflaschen-WMF) 05Open>03Resolved >>! In T119511#2384613, @Nemo_bis wrote: > h... [17:47:40] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:52:50] 07Blocked-on-Operations, 06Operations, 10Monitoring, 06Services: Update restbase catchpoint metric - https://phabricator.wikimedia.org/T137181#2386062 (10GWicke) Thanks, @mark! [17:53:02] (03PS1) 10Thcipriani: scap: make deployment aware of canary machines [puppet] - 10https://gerrit.wikimedia.org/r/294742 (https://phabricator.wikimedia.org/T110068) [17:54:31] 06Operations, 06Performance-Team, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2386066 (10BBlack) Other interesting references: https://datatracker.ietf.org/doc/draft-alakuijala-brotli/ (IETF standard, seems pretty far along in the approval process) https://blog.cloudfla... [17:59:20] (03CR) 10Thcipriani: "I would like to add a target object in scap that uses etcd to get a list of targets; however, looking at what's currently available via co" [puppet] - 10https://gerrit.wikimedia.org/r/294742 (https://phabricator.wikimedia.org/T110068) (owner: 10Thcipriani) [18:11:33] 06Operations, 06Performance-Team, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2386165 (10BBlack) p:05Normal>03Low A very quick check (just a couple of minutes on one cache_text machine) shows about 7% of requests indicate brotli support in Accept-Encoding. Not big e... [18:20:12] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2386183 (10Whatamidoing-WMF) I've posted notes for the newest four. [18:21:29] 06Operations, 10Traffic, 06Wikipedia-Android-App-Backlog, 10iOS-app-Bugs: Zero: Investigate removing the limit on carrier tagging to m-dot and zero-dot requests - https://phabricator.wikimedia.org/T137990#2386185 (10Mholloway) [18:21:44] 06Operations, 10Traffic, 06Wikipedia-Android-App-Backlog, 06Zero, 10iOS-app-Bugs: Zero: Investigate removing the limit on carrier tagging to m-dot and zero-dot requests - https://phabricator.wikimedia.org/T137990#2386188 (10Mholloway) [18:22:29] !log change-prop deploying bc87a1fecfa [18:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:56] * robh is running out for lunch (just mentioning it since he is on ops clinic duty) [18:24:48] (03CR) 10Urbanecm: [C: 031] "Looks good for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294652 (https://phabricator.wikimedia.org/T137888) (owner: 10Luke081515) [18:26:20] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:35:38] 06Operations, 06Performance-Team, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10ori) >>! In T137979#2386165, @BBlack wrote: > A very quick check (just a couple of minutes on one cache_text machine) shows about 7% of requests indicate brotli support in Accept-Enc... [18:37:02] !log running invalidateUserSessions.php for T137799 [18:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [18:49:42] 06Operations, 06Performance-Team, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2385879 (10Krinkle) >>! In T137979#2386210, @ori wrote: >>>! In T137979#2386165, @BBlack wrote: >> A very quick check (just a couple of minutes on one cache_text machine) shows about 7% of requ... [18:50:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [18:51:39] 06Operations, 06Performance-Team, 10Traffic: Support brotli compression - https://phabricator.wikimedia.org/T137979#2386232 (10BBlack) Yeah ori's right, I didn't filter properly. Interesting! [18:55:09] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [19:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160616T1900). [19:00:09] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [19:00:29] jouncebot: o/ [19:01:01] (03PS1) 10Papaul: DHCP: Add mw2243 MAC address Bug:T135466 [puppet] - 10https://gerrit.wikimedia.org/r/294745 (https://phabricator.wikimedia.org/T135466) [19:02:29] (03PS1) 10Hashar: all wikis to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294746 [19:02:40] 896 migrated ! [19:03:02] (03PS2) 10Hashar: all wikis to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294746 (https://phabricator.wikimedia.org/T136971) [19:04:37] (03CR) 10Hashar: [C: 032] all wikis to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294746 (https://phabricator.wikimedia.org/T136971) (owner: 10Hashar) [19:05:13] (03Merged) 10jenkins-bot: all wikis to 1.28.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294746 (https://phabricator.wikimedia.org/T136971) (owner: 10Hashar) [19:05:32] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.28.0-wmf.6 [19:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:42] doh [19:05:45] doesnt sound right [19:06:19] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [19:07:28] Jun 16 19:06:44 mw1138: #012Warning: parseAndStash() expects exactly 4 parameters, 3 given in /srv/mediawiki/php-1.28.0-wmf.6/includes/api/ApiStashEdit.php on line 182 [19:07:28] Jun 16 19:06:45 mw1138: #012Notice: Undefined variable: summary in /srv/mediawiki/php-1.28.0-wmf.6/includes/api/ApiStashEdit.php on line 157 [19:07:54] hashar: AaronSchulz / ori [19:08:14] * aude_ waves [19:08:30] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [19:08:37] I am wondering whether that has an impact on actual editions [19:08:50] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: puppet fail [19:10:13] at least https://grafana.wikimedia.org/dashboard/db/edit-count does not show any drop [19:11:04] edit summaries seem to still work [19:11:21] hmm, possibly just stashing [19:11:28] though sure i could be missing something [19:11:37] doesn't stashing always happen [19:11:39] ? [19:11:42] ori or aaron will probably soon notice that their stash rate metrics just dropped by 100% :P [19:11:46] with wikitext editing? [19:13:06] MatmaRex: meeting now, I can look soon [19:14:44] filled as https://phabricator.wikimedia.org/T137995 [19:17:25] (03CR) 10JanZerebecki: "I did a quick check of how cargo verifies downloads: https://phabricator.wikimedia.org/T137996 please reopen if more is needed." [debs/geckodriver] - 10https://gerrit.wikimedia.org/r/294293 (https://phabricator.wikimedia.org/T137797) (owner: 10Hashar) [19:21:49] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures [19:25:39] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:35:54] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:47:33] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:48:38] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2386535 (10mmodell) [19:51:06] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2386536 (10mmodell) [19:51:24] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2378301 (10mmodell) [19:52:18] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2378301 (10mmodell) [19:53:33] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.840 second response time [19:53:56] (03PS2) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 [19:55:44] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 14.663 second response time [19:55:53] (03CR) 10Gehel: Team-interactive receives maps alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [19:56:10] (03PS4) 10Gehel: Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) [19:57:31] (03CR) 10jenkins-bot: [V: 04-1] Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [19:58:08] (03CR) 10Gehel: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [20:03:10] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [20:10:52] (03PS3) 10EBernhardson: search: Dependent config for textcat AB test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294739 [20:10:57] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [20:19:33] (03CR) 10EBernhardson: "minor quibble, but if adding a parameters documentation section might as well document the existing parameter as well." [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [20:20:32] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [20:22:53] (03PS5) 10Gehel: Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) [20:23:43] (03CR) 10Gehel: "@EBernhardson: documentation added, but I'm not really sure of what that does. I'll ask @yurik to check it..." [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [20:24:05] (03PS1) 10JanZerebecki: Add gitblit compatibility apache vhost to phabricator [puppet] - 10https://gerrit.wikimedia.org/r/294784 (https://phabricator.wikimedia.org/T137224) [20:25:00] (03CR) 10Paladox: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/294784 (https://phabricator.wikimedia.org/T137224) (owner: 10JanZerebecki) [20:25:21] (03PS6) 10Gehel: Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) [20:28:01] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:29:36] (03PS7) 10Gehel: Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) [20:31:52] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2343854 (10Danmichaelo) Fixed CatWatchBot and hopefully the remaining tasks for DanmicholoBot [20:34:45] (03PS8) 10Gehel: Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) [20:37:03] (03CR) 10Yurik: [C: 031] Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [20:38:36] (03CR) 10Gehel: [C: 032] Team-interactive receives maps alerts [puppet] - 10https://gerrit.wikimedia.org/r/294723 (https://phabricator.wikimedia.org/T137869) (owner: 10Gehel) [20:43:50] (03PS1) 10Yuvipanda: labspuppetbackend: Make sure to propogate errors to uwsgi log [puppet] - 10https://gerrit.wikimedia.org/r/294834 [20:43:53] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2386810 (10Gehel) [20:43:55] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Review alerting scheme for Maps - https://phabricator.wikimedia.org/T137869#2386809 (10Gehel) 05Open>03Resolved [20:44:45] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2386811 (10Gehel) Check implemented, also alerting team-interactive. [20:44:54] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2241419 (10Gehel) [20:44:56] 06Operations, 06Discovery, 06Maps, 13Patch-For-Review: "Is maps service alive?" check - https://phabricator.wikimedia.org/T137851#2386812 (10Gehel) 05Open>03Resolved [20:45:09] (03PS2) 10JanZerebecki: Add gitblit compatibility apache vhost to phabricator [puppet] - 10https://gerrit.wikimedia.org/r/294784 (https://phabricator.wikimedia.org/T137224) [20:45:15] andrewbogott (IRC): if you merge ^ patch you can re-enable puppet on labstestcontrol2001 [20:45:56] (03PS2) 10Andrew Bogott: labspuppetbackend: Make sure to propogate errors to uwsgi log [puppet] - 10https://gerrit.wikimedia.org/r/294834 (owner: 10Yuvipanda) [20:46:40] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:44] (03CR) 10jenkins-bot: [V: 04-1] labspuppetbackend: Make sure to propogate errors to uwsgi log [puppet] - 10https://gerrit.wikimedia.org/r/294834 (owner: 10Yuvipanda) [20:50:22] 06Operations, 06Discovery, 06Maps: Improve automation around Maps servers - https://phabricator.wikimedia.org/T138017#2386884 (10Gehel) [20:51:19] 06Operations, 06Discovery, 06Maps: Ensure that maps server can be automatically installed (fully puppetized) - https://phabricator.wikimedia.org/T135750#2386903 (10Gehel) We are good enough at the moment. Some notes about things we still need to improve are in T138017. [20:51:21] (03PS3) 10Yuvipanda: labspuppetbackend: Make sure to propogate errors to uwsgi log [puppet] - 10https://gerrit.wikimedia.org/r/294834 [20:51:25] 06Operations, 06Discovery, 06Maps: Ensure that maps server can be automatically installed (fully puppetized) - https://phabricator.wikimedia.org/T135750#2386905 (10Gehel) 05Open>03Resolved [20:51:26] (03PS4) 10Andrew Bogott: labspuppetbackend: Make sure to propogate errors to uwsgi log [puppet] - 10https://gerrit.wikimedia.org/r/294834 (owner: 10Yuvipanda) [20:51:27] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2386906 (10Gehel) [20:53:34] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: Make sure to propogate errors to uwsgi log [puppet] - 10https://gerrit.wikimedia.org/r/294834 (owner: 10Yuvipanda) [20:53:51] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 2 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2386912 (10GWicke) [20:59:21] (03PS2) 10Gehel: maps caches: remove referrer checks [puppet] - 10https://gerrit.wikimedia.org/r/294390 (https://phabricator.wikimedia.org/T137848) (owner: 10MaxSem) [21:01:03] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [21:01:52] (03CR) 10Gehel: [C: 032] "Alerting is good, so are all blockers to https://phabricator.wikimedia.org/T133744" [puppet] - 10https://gerrit.wikimedia.org/r/294390 (https://phabricator.wikimedia.org/T137848) (owner: 10MaxSem) [21:03:42] (03PS3) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 [21:03:44] (03PS1) 10Yuvipanda: Add appropriate dependencies to package [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294837 [21:15:28] !log hashar@tin Synchronized php-1.28.0-wmf.6/extensions/VisualEditor/ApiVisualEditor.php: Pass empty summary to parseAndStash() to avoid warnings T137995 (duration: 00m 39s) [21:15:29] T137995: ApiStashEdit warning and notices - https://phabricator.wikimedia.org/T137995 [21:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:30] (train is done) [21:16:52] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:18:07] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 2 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387040 (10GWicke) Another possibility is that there are issues with the eventlogging instrumentation. The number... [21:20:37] 06Operations, 06Discovery, 10Kartotherian, 06Maps, and 3 others: Remove referrer check from varnish for maps cluster - https://phabricator.wikimedia.org/T137848#2387056 (10Gehel) 05Open>03Resolved [21:27:33] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:33:32] (03PS2) 10Jhobs: Prepare Wikidata descriptions on mobile for production rollout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293883 (https://phabricator.wikimedia.org/T127250) [21:36:36] (03CR) 10Bmansurov: [C: 031] Prepare Wikidata descriptions on mobile for production rollout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293883 (https://phabricator.wikimedia.org/T127250) (owner: 10Jhobs) [21:43:18] (03CR) 10Yuvipanda: [C: 032] Add appropriate dependencies to package [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294837 (owner: 10Yuvipanda) [21:46:18] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 2 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387115 (10Mholloway) Sure, one or both of us can look into this. [21:46:21] (03PS4) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 [21:46:24] (03PS1) 10Yuvipanda: Exit when given unsupported parameters [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294843 [21:46:26] (03PS1) 10Yuvipanda: Set explicit default for args.type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294844 [22:00:04] yurik and maxsem: Dear anthropoid, the time has come. Please deploy Enable Maps Wikidata & Commons (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160616T2200). [22:04:48] 06Operations, 10DBA, 06Labs, 10Tool-Labs, 10Traffic: Antigng-bot improper non-api http requests - https://phabricator.wikimedia.org/T137707#2376097 (10Legoktm) >>! In T137707#2382924, @Antigng_ wrote: > Lack of hard and fast limit on read requests can be a problem, since your definition of request limit... [22:05:17] 06Operations, 03Discovery-Search-Sprint: Enable GC (garbage collection) logs on Elasticsearch JVM - https://phabricator.wikimedia.org/T134853#2387206 (10debt) [22:05:41] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [22:13:55] (03PS1) 10MaxSem: Enable Kartographer on Commons and Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294854 (https://phabricator.wikimedia.org/T138029) [22:16:31] (03CR) 10Yurik: [C: 031] Enable Kartographer on Commons and Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294854 (https://phabricator.wikimedia.org/T138029) (owner: 10MaxSem) [22:18:44] MaxSem, there are two patches i cherrypicked [22:19:22] https://gerrit.wikimedia.org/r/#/c/294856/ [22:19:25] https://gerrit.wikimedia.org/r/#/c/294855/ [22:20:51] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [22:21:27] MaxSem, ^^ [22:21:50] 06Operations, 10Parsoid, 06Services: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2387274 (10ssastry) Addendum to my earlier performance numbers: On a bunch of pages, looks like DOM post processing is about 2x faster on v4.3 vs v0.10 on my laptop. [22:22:45] (03CR) 10MaxSem: [C: 032] Enable Kartographer on Commons and Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294854 (https://phabricator.wikimedia.org/T138029) (owner: 10MaxSem) [22:22:51] * yurik hides [22:23:18] (03Merged) 10jenkins-bot: Enable Kartographer on Commons and Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294854 (https://phabricator.wikimedia.org/T138029) (owner: 10MaxSem) [22:24:13] (03CR) 10Niedzielski: "@hashar, I was afraid I had said something!" [puppet] - 10https://gerrit.wikimedia.org/r/264303 (owner: 10Niedzielski) [22:24:27] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/294854/ (duration: 00m 26s) [22:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:27:01] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:32:57] MaxSem, maps show up on both, all's good [22:33:37] !log maxsem@tin Synchronized php-1.28.0-wmf.6/extensions/Kartographer: https://gerrit.wikimedia.org/r/294856 https://gerrit.wikimedia.org/r/294855 (duration: 00m 30s) [22:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:46:51] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:59:01] 06Operations, 06Collaboration-Team-Interested, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1499219 (10Mattflaschen-WMF) [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, awight, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160616T2300). [23:00:04] dr0ptp4kt, Luke081515, and EBernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:16] * Luke081515 is here [23:00:35] * dr0ptp4kt here [23:02:04] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [23:02:44] here [23:03:36] who will SWAT? [23:04:01] well, since noone is jumping i suppose i can do it [23:04:41] first in the list, dr0ptp4kt [23:04:51] Hello. [23:04:52] labs only change, seems safe enough [23:04:53] wooooooo! i'm going to disneyland! [23:05:01] (03PS3) 10EBernhardson: Prepare Wikidata descriptions on mobile for production rollout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293883 (https://phabricator.wikimedia.org/T127250) (owner: 10Jhobs) [23:05:08] dr0ptp4kt, beware of gators [23:05:15] (03CR) 10EBernhardson: [C: 032] Prepare Wikidata descriptions on mobile for production rollout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293883 (https://phabricator.wikimedia.org/T127250) (owner: 10Jhobs) [23:05:19] Thanks ebernhardson to take care of this SWAT :) [23:05:22] Dereckson: np [23:05:35] MaxSem! [23:05:45] (03PS2) 10EBernhardson: Two permission changes at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294652 (https://phabricator.wikimedia.org/T137888) (owner: 10Luke081515) [23:05:56] (03Merged) 10jenkins-bot: Prepare Wikidata descriptions on mobile for production rollout [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293883 (https://phabricator.wikimedia.org/T127250) (owner: 10Jhobs) [23:07:40] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings-labs.php: T127250: Prepare Wikidata descriptions on mobile for production rollout (duration: 00m 27s) [23:07:41] T127250: Prepare Wikidata descriptions to roll out to stable - https://phabricator.wikimedia.org/T127250 [23:07:41] (03CR) 10EBernhardson: [C: 032] "patch matches ticket. SWAT'ing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294652 (https://phabricator.wikimedia.org/T137888) (owner: 10Luke081515) [23:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:49] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#2387618 (10Mattflaschen-WMF) a:03jcrespo @jcrespo, I think this is the next concrete step ({T119568} will get QA-ed, but that'... [23:08:51] dr0ptp4kt: your patch is synced [23:09:00] ebernhardson: thx, will check [23:09:04] (03PS3) 10Luke081515: Two permission changes at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294652 (https://phabricator.wikimedia.org/T137888) [23:09:14] ebernhardson: another rebase was needed :-/ [23:09:26] only ff is sometimes annoying [23:09:50] indeed :) [23:10:08] (03CR) 10EBernhardson: [C: 032] "one more time!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294652 (https://phabricator.wikimedia.org/T137888) (owner: 10Luke081515) [23:10:37] ebernhardson: looks good. no fatals? if so, good. [23:10:42] (03Merged) 10jenkins-bot: Two permission changes at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294652 (https://phabricator.wikimedia.org/T137888) (owner: 10Luke081515) [23:11:36] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T137888: Two permission changes at urwiki (duration: 00m 27s) [23:11:37] T137888: Enable Accountcreator and Filemover groups on Urdu Wikipedia - https://phabricator.wikimedia.org/T137888 [23:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:06] Luke081515: you're synced out. I imagine you can't directly test though? [23:12:21] ebernhardson: I checked Special:ListGroupRights, it works :) [23:12:35] thank you for swat :) [23:13:30] sweet [23:13:58] (03PS4) 10EBernhardson: search: Dependent config for textcat AB test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294739 [23:14:06] (03CR) 10EBernhardson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294739 (owner: 10EBernhardson) [23:14:46] (03Merged) 10jenkins-bot: search: Dependent config for textcat AB test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294739 (owner: 10EBernhardson) [23:16:01] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T137167: search: Dependent config for textcat AB test. (duration: 00m 26s) [23:16:02] T137167: Part Deux: TextCat A/B test for Language Identification - create and deploy - https://phabricator.wikimedia.org/T137167 [23:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:01] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#2387654 (10Mattflaschen-WMF) [23:18:47] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#1499219 (10Mattflaschen-WMF) [23:19:42] !log ebernhardson@tin Synchronized php-1.28.0-wmf.6/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: T137167: TextCat A/B test for Language Identification (duration: 00m 24s) [23:19:43] T137167: Part Deux: TextCat A/B test for Language Identification - create and deploy - https://phabricator.wikimedia.org/T137167 [23:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:10] !log ebernhardson@tin Synchronized php-1.28.0-wmf.6/extensions/WikimediaEvents/extension.json: T137167: TextCat A/B test for Language Identification (duration: 00m 24s) [23:24:10] T137167: Part Deux: TextCat A/B test for Language Identification - create and deploy - https://phabricator.wikimedia.org/T137167 [23:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:53] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:40:19] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2387684 (10Papaul) a:05RobH>03Papaul [23:43:43] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2387689 (10Danny_B) URLs marked with {icon check-square-o color=green} are redirected to their appropriate or similar... [23:44:42] !log ebernhardson@tin Synchronized php-1.28.0-wmf.6/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: T137167: TextCat A/B test for Language Identification (duration: 00m 25s) [23:44:43] T137167: Part Deux: TextCat A/B test for Language Identification - create and deploy - https://phabricator.wikimedia.org/T137167 [23:44:49] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2387692 (10Paladox) @Danny_B thankyou :). [23:49:35] 06Operations, 10ops-codfw, 10media-storage: codfw: rack/setup/deploy ms-be202[2-7] switch configuration - https://phabricator.wikimedia.org/T138052#2387694 (10Papaul) [23:54:56] (03PS8) 10Mattflaschen: Change login cookies (for 'Remember me') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699)