[00:00:04] RoanKattouw, ^d, marktraceur, kaldari: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150107T0000). Please do the needful. [00:01:21] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [00:02:35] (03PS1) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [00:02:40] bblack: ^ [00:03:00] (03PS2) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [00:03:54] Anybody one SWAT deploy duty today? If not, I can do my own. [00:03:55] ori: I'm waiting for PS3 :) [00:04:14] PS2 just changed the commit message [00:05:22] i'm hoping it's less painful to review than vcl commits typically are because i based it on the HHVM pool thing [00:05:50] :) [00:08:44] ori: cluster_tier stuff applies everywhere... [00:09:06] ori: cluster_tier is e.g. esams-vs-eqiad, not front-vs-back layer within one site [00:09:37] we probably want to hold off setting the backend until it reaches eqiad [00:11:12] bblack: i'm not sure what you mean. so any set req.backend = test_wikipedia should be gated with an <% if @vcl_config.fetch("cluster_tier", "1") == "1" -%> ? [00:12:19] ori: yeah, the setting of the backend should be, but not the setting of hit_for_pass + 0s (which actually becomes 120s later, but that's ok) [00:12:42] so that all the test reqs flow through all the normal layers and hit all the normal logic, basically. [00:16:49] (03PS3) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [00:38:58] (03PS1) 10Springle: depool db1004 db1007 db1010 db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183173 [00:39:36] (03CR) 10Springle: [C: 032] depool db1004 db1007 db1010 db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183173 (owner: 10Springle) [00:41:20] (03Merged) 10jenkins-bot: depool db1004 db1007 db1010 db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183173 (owner: 10Springle) [00:49:04] (03PS1) 10Springle: sideline db1004 db1007 db1010. move db1015 to s3 [puppet] - 10https://gerrit.wikimedia.org/r/183174 [00:51:42] (03CR) 10Springle: [C: 032] sideline db1004 db1007 db1010. move db1015 to s3 [puppet] - 10https://gerrit.wikimedia.org/r/183174 (owner: 10Springle) [00:52:02] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 336 seconds [00:52:05] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 339 seconds [00:53:12] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:15] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [01:06:05] (03PS1) 10Springle: deploy es2005 es2007 es2 codfw [puppet] - 10https://gerrit.wikimedia.org/r/183178 [01:07:17] (03CR) 10Springle: [C: 032] deploy es2005 es2007 es2 codfw [puppet] - 10https://gerrit.wikimedia.org/r/183178 (owner: 10Springle) [01:11:23] !log xtrabackup clone es2006 to es2005 [01:11:34] Logged the message, Master [01:14:11] (03PS6) 10Krinkle: contint: hourly auto update of wikimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/183019 (owner: 10Hashar) [01:14:22] (03PS14) 10Krinkle: contint: provision hhvm on CI slaves [puppet] - 10https://gerrit.wikimedia.org/r/178806 (owner: 10Hashar) [01:16:43] (03CR) 10Krinkle: contint: provision hhvm on CI slaves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/178806 (owner: 10Hashar) [01:20:31] (03PS1) 10Springle: deploy es2009 es2010 to es3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/183180 [01:22:29] (03CR) 10Springle: [C: 032] deploy es2009 es2010 to es3 codfw [puppet] - 10https://gerrit.wikimedia.org/r/183180 (owner: 10Springle) [01:22:34] PROBLEM - RAID on es2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [01:24:48] !log kaldari Synchronized php-1.25wmf13/extensions/MobileFrontend/: Sync MobileFrontend in 1.25wmf13 for VE fix (duration: 00m 05s) [01:24:54] Logged the message, Master [01:27:34] PROBLEM - puppet last run on es2009 is CRITICAL: CRITICAL: Puppet has 1 failures [01:28:44] RECOVERY - puppet last run on es2009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:31:55] ACKNOWLEDGEMENT - RAID on es2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Sean Pringle T85978 [01:33:38] !log xtrabackup clone es2008 to es2009 [01:33:45] Logged the message, Master [01:56:59] !log ori Synchronized php-1.25wmf12/extensions/EventLogging: I07d9bc8: Update EventLogging for cherry-picks (duration: 00m 06s) [01:57:05] !log ori Synchronized php-1.25wmf13/extensions/EventLogging: I69c8daf: Update EventLogging for cherry-picks (duration: 00m 05s) [01:57:06] Logged the message, Master [01:57:09] Logged the message, Master [02:01:01] andre__, herald rule for those AASCIT spam mails perhaps? [02:01:29] Krenair: feel free to propose criteria for one [02:02:19] author: emailbot, contains: aascit [02:03:41] so "Author: is any of: emailbot" and "Body: contains: aascit"? [02:03:54] something like that, surer [02:03:56] sure* [02:05:32] !log kaldari Synchronized php-1.25wmf13/extensions/WikiGrok/: Fixing campaign generation in WikiGrok (duration: 00m 05s) [02:05:38] Logged the message, Master [02:06:13] Krenair, Herald does not allow removing projects or setting the task status to invalid [02:06:24] ah that's a shame [02:06:34] oh well [02:52:28] (03PS2) 10Gergő Tisza: [WIP] Deploy Sentry on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181439 [04:00:32] (03PS3) 10KartikMistry: Beta: Fix spacing and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/178772 [05:17:29] (03CR) 10Glaisher: "While the naming is quite different from other groups, 'interface_editor' is already used in other wikis as well and is localized. The onl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183056 (owner: 10Glaisher) [06:28:05] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail [06:28:44] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: puppet fail [06:28:55] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:55] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:05] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:14] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:25] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:34] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:46:07] (03CR) 10Glaisher: "Reedy: is w/404.php in sync? I'm still seeing "font-family: 'Gill Sans', 'Gill Sans MT', sans-serif;" in https://en.wikipedia.org/w/404.ph" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181869 (owner: 10Glaisher) [06:46:15] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:14] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:47:54] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 12 hours old. [06:49:04] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/ubuntu/project/trace/carbon.wikimedia.org is over 0 hours old. [06:49:59] !log xtrabackup clone es2006 to es2007 [06:50:04] Logged the message, Master [06:50:35] !log xtrabackup clone es2008 to es2010 [06:50:38] Logged the message, Master [07:12:25] YuviPanda: around? [07:12:29] kart_: sup [07:13:10] YuviPanda: https://phabricator.wikimedia.org/T85106 - once I have script, needed packages are ok to install on Beta? [07:13:38] betalabs? [07:13:42] (eg python-mysqldb) [07:13:44] yes. [07:13:47] it should probably run in the stat cluster [07:13:51] and not worry about betalabs at all [07:14:10] stat1003 has lots of cron jobs of similar nature running, and has a public http endpoint [07:14:23] so should just do that there, I think. [07:14:37] YuviPanda: where I can 'see' them? [07:14:44] as in? [07:14:47] the code? [07:14:52] code/cron [07:14:54] yes. [07:15:30] kart_: misc/statistics.pp [07:16:35] Thanks. [07:18:40] :) [07:19:10] confusing but doable :) [07:53:02] <_joe_> !log reimaging jobrunners mw1013-mw1016 (in batch of two) [07:53:04] Logged the message, Master [08:22:33] PROBLEM - salt-minion processes on mw1015 is CRITICAL: Connection refused by host [08:22:43] PROBLEM - DPKG on mw1016 is CRITICAL: Connection refused by host [08:22:54] PROBLEM - DPKG on mw1015 is CRITICAL: Connection refused by host [08:22:54] PROBLEM - Disk space on mw1016 is CRITICAL: Connection refused by host [08:23:14] PROBLEM - Disk space on mw1015 is CRITICAL: Connection refused by host [08:23:23] PROBLEM - RAID on mw1016 is CRITICAL: Connection refused by host [08:23:24] PROBLEM - RAID on mw1015 is CRITICAL: Connection refused by host [08:23:43] PROBLEM - configured eth on mw1016 is CRITICAL: Connection refused by host [08:23:54] PROBLEM - dhclient process on mw1016 is CRITICAL: Connection refused by host [08:23:54] PROBLEM - configured eth on mw1015 is CRITICAL: Connection refused by host [08:23:57] <_joe_> this is me reimaging [08:24:04] PROBLEM - dhclient process on mw1015 is CRITICAL: Connection refused by host [08:24:23] PROBLEM - nutcracker port on mw1016 is CRITICAL: Connection refused by host [08:24:33] PROBLEM - nutcracker process on mw1016 is CRITICAL: Connection refused by host [08:24:33] PROBLEM - nutcracker port on mw1015 is CRITICAL: Connection refused by host [08:24:43] PROBLEM - puppet last run on mw1016 is CRITICAL: Connection refused by host [08:24:43] PROBLEM - nutcracker process on mw1015 is CRITICAL: Connection refused by host [08:24:54] PROBLEM - puppet last run on mw1015 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:24:54] PROBLEM - salt-minion processes on mw1016 is CRITICAL: Connection refused by host [08:32:14] RECOVERY - Disk space on mw1015 is OK: DISK OK [08:32:24] RECOVERY - nutcracker port on mw1015 is OK: TCP OK - 0.000 second response time on port 11212 [08:32:33] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [08:32:34] RECOVERY - nutcracker process on mw1015 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:32:43] RECOVERY - salt-minion processes on mw1015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:32:44] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:32:54] RECOVERY - DPKG on mw1016 is OK: All packages OK [08:32:54] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [08:32:54] RECOVERY - configured eth on mw1015 is OK: NRPE: Unable to read output [08:33:13] RECOVERY - Disk space on mw1016 is OK: DISK OK [08:33:13] RECOVERY - DPKG on mw1015 is OK: All packages OK [08:33:13] RECOVERY - dhclient process on mw1015 is OK: PROCS OK: 0 processes with command name dhclient [08:33:24] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [08:33:33] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [08:33:33] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:33:53] RECOVERY - configured eth on mw1016 is OK: NRPE: Unable to read output [08:46:24] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:46:34] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [08:47:16] (03PS1) 10Faidon Liambotis: Remove naggen (v1), unused nowadays [puppet] - 10https://gerrit.wikimedia.org/r/183206 [08:47:18] (03PS1) 10Faidon Liambotis: Remove default_gateway.rb fact, unused [puppet] - 10https://gerrit.wikimedia.org/r/183207 [08:47:20] (03PS1) 10Faidon Liambotis: Remove facter_dot_d fact, unused [puppet] - 10https://gerrit.wikimedia.org/r/183208 [08:47:22] (03PS1) 10Faidon Liambotis: Remove custom fact ec2id, replaced by facter's ec2 [puppet] - 10https://gerrit.wikimedia.org/r/183209 [08:50:57] (03CR) 10Faidon Liambotis: [C: 032] Remove naggen (v1), unused nowadays [puppet] - 10https://gerrit.wikimedia.org/r/183206 (owner: 10Faidon Liambotis) [08:51:07] (03CR) 10Faidon Liambotis: [C: 032] Remove default_gateway.rb fact, unused [puppet] - 10https://gerrit.wikimedia.org/r/183207 (owner: 10Faidon Liambotis) [08:51:21] (03CR) 10Faidon Liambotis: [C: 032] Remove facter_dot_d fact, unused [puppet] - 10https://gerrit.wikimedia.org/r/183208 (owner: 10Faidon Liambotis) [08:57:24] paravoid: yep, waiting on its disk https://phabricator.wikimedia.org/T85591 [09:13:25] paravoid: good morning :) apt::conf() does not support multiple keys, should I just create a file per setting I want to apply ? ( context https://gerrit.wikimedia.org/r/#/c/183019/5/modules/contint/manifests/packages/labs.pp ) [09:16:39] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Fix language code [puppet] - 10https://gerrit.wikimedia.org/r/183001 (owner: 10KartikMistry) [09:16:59] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/183001 (owner: 10KartikMistry) [09:17:23] (03PS1) 10Giuseppe Lavagetto: hhvm: make cache files management explicit in puppet [puppet] - 10https://gerrit.wikimedia.org/r/183214 [09:17:29] (03CR) 10Hashar: contint: hourly auto update of wikimedia packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183019 (owner: 10Hashar) [09:18:04] (03PS7) 10Hashar: contint: hourly auto update of wikimedia packages [puppet] - 10https://gerrit.wikimedia.org/r/183019 [09:21:11] paravoid: bah I did deleted the ld and offlined the pd but /dev/sdd didn't go anywhere, ideas on how to make it go away or just a reboot [09:24:48] (03CR) 10Filippo Giunchedi: [C: 031] "just one comment, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183214 (owner: 10Giuseppe Lavagetto) [09:28:54] (03CR) 10Hashar: "Applied on CI puppetmaster:" [puppet] - 10https://gerrit.wikimedia.org/r/183019 (owner: 10Hashar) [09:29:07] (03CR) 10Giuseppe Lavagetto: hhvm: make cache files management explicit in puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183214 (owner: 10Giuseppe Lavagetto) [09:29:11] (03PS2) 10Giuseppe Lavagetto: hhvm: make cache files management explicit in puppet [puppet] - 10https://gerrit.wikimedia.org/r/183214 [09:32:21] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hhvm: make cache files management explicit in puppet [puppet] - 10https://gerrit.wikimedia.org/r/183214 (owner: 10Giuseppe Lavagetto) [09:41:04] (03CR) 10Filippo Giunchedi: [C: 031] redis: Get rid of unused (and unhelpful) module parameters [puppet] - 10https://gerrit.wikimedia.org/r/183060 (owner: 10Ori.livneh) [09:46:51] (03CR) 10Giuseppe Lavagetto: [C: 031] "I am pretty sure the problem was having the hotprofiler being invoked and not just enabled, but we didn't establish that precisely, so my " [puppet] - 10https://gerrit.wikimedia.org/r/182992 (owner: 10Ori.livneh) [09:47:02] !log restarting Jenkins to resolve a deadlocks with the beta cluster jobs [09:47:09] Logged the message, Master [09:50:42] bah the varnish vcl have some issue :-( [09:50:42] ('mobile-frontend.inc.vcl' Line 102 Pos 40) [09:50:42] if (req.http.X-Wikimedia-Debug = "1") { [09:51:38] * hashar blames ori [09:52:35] <_joe_> I am taking a look at that change right now [09:52:39] <_joe_> it's not merged [09:52:53] <_joe_> so... where is that? [09:55:24] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 650 [09:57:53] (03CR) 10Hashar: "The patch applied on the beta cluster had some VCL typo which prevents reloading varnish on the mobile and text cache." [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [10:00:24] RECOVERY - check_mysql on db1008 is OK: Uptime: 7245206 Threads: 2 Questions: 207972965 Slow queries: 51500 Opens: 135106 Flush tables: 2 Open tables: 64 Queries per second avg: 28.704 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:01:03] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like the concept and the implementation, just a couple of syntax errors and one minor concern to address for me." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [10:01:26] (03CR) 10Filippo Giunchedi: "approval in related ticket, merging" [puppet] - 10https://gerrit.wikimedia.org/r/181556 (owner: 10Filippo Giunchedi) [10:02:25] PROBLEM - Disk space on mw1013 is CRITICAL: Connection refused by host [10:02:45] PROBLEM - RAID on mw1014 is CRITICAL: Connection refused by host [10:02:55] PROBLEM - RAID on mw1013 is CRITICAL: Connection refused by host [10:03:14] PROBLEM - configured eth on mw1014 is CRITICAL: Connection refused by host [10:03:24] PROBLEM - configured eth on mw1013 is CRITICAL: Connection refused by host [10:03:24] PROBLEM - dhclient process on mw1014 is CRITICAL: Connection refused by host [10:03:34] PROBLEM - dhclient process on mw1013 is CRITICAL: Connection refused by host [10:03:45] PROBLEM - nutcracker port on mw1014 is CRITICAL: Connection refused by host [10:03:45] <_joe_> Timing-Allow-Origin [10:03:50] <_joe_> wow, this is new [10:03:55] PROBLEM - nutcracker process on mw1014 is CRITICAL: Connection refused by host [10:03:55] PROBLEM - nutcracker port on mw1013 is CRITICAL: Connection refused by host [10:04:14] PROBLEM - nutcracker process on mw1013 is CRITICAL: Connection refused by host [10:04:14] PROBLEM - puppet last run on mw1014 is CRITICAL: Connection refused by host [10:04:24] PROBLEM - puppet last run on mw1013 is CRITICAL: Connection refused by host [10:04:24] PROBLEM - salt-minion processes on mw1014 is CRITICAL: Connection refused by host [10:04:34] PROBLEM - salt-minion processes on mw1013 is CRITICAL: Connection refused by host [10:04:44] PROBLEM - DPKG on mw1014 is CRITICAL: Connection refused by host [10:04:47] _joe_: sorry that was on the beta cluster [10:04:50] (03PS2) 10Filippo Giunchedi: admin: awight stats/hive access [puppet] - 10https://gerrit.wikimedia.org/r/181556 [10:04:55] PROBLEM - Disk space on mw1014 is CRITICAL: Connection refused by host [10:04:55] PROBLEM - DPKG on mw1013 is CRITICAL: Connection refused by host [10:06:40] _joe_: I thought it landed in operations/puppet.git already :D [10:07:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: awight stats/hive access [puppet] - 10https://gerrit.wikimedia.org/r/181556 (owner: 10Filippo Giunchedi) [10:12:25] RECOVERY - configured eth on mw1013 is OK: NRPE: Unable to read output [10:12:25] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:12:35] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [10:12:45] RECOVERY - Disk space on mw1013 is OK: DISK OK [10:12:54] RECOVERY - DPKG on mw1013 is OK: All packages OK [10:13:05] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [10:13:14] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [10:13:15] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:13:25] RECOVERY - configured eth on mw1014 is OK: NRPE: Unable to read output [10:13:34] RECOVERY - salt-minion processes on mw1014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:13:35] RECOVERY - dhclient process on mw1014 is OK: PROCS OK: 0 processes with command name dhclient [10:13:54] RECOVERY - DPKG on mw1014 is OK: All packages OK [10:13:55] RECOVERY - nutcracker port on mw1014 is OK: TCP OK - 0.000 second response time on port 11212 [10:14:04] RECOVERY - Disk space on mw1014 is OK: DISK OK [10:14:05] RECOVERY - RAID on mw1014 is OK: OK: no RAID installed [10:14:14] RECOVERY - nutcracker process on mw1014 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:14:36] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet has 4 failures [10:15:35] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 6 failures [10:16:05] PROBLEM - NTP on mw1014 is CRITICAL: NTP CRITICAL: Offset unknown [10:16:25] PROBLEM - NTP on mw1013 is CRITICAL: NTP CRITICAL: Offset unknown [10:18:44] RECOVERY - NTP on mw1013 is OK: NTP OK: Offset 0.001754045486 secs [10:19:00] !log reboot ms-be2003, deleted LD should disappear [10:19:07] Logged the message, Master [10:19:35] RECOVERY - NTP on mw1014 is OK: NTP OK: Offset -0.006245732307 secs [10:22:54] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:31:45] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:47:58] (03PS1) 10Hashar: contint: install Java 8 on Trusty servers [puppet] - 10https://gerrit.wikimedia.org/r/183222 [10:49:08] <_joe_> hashar: why are you doing this? [10:49:58] <_joe_> I mean, we do have an experimental java 8 package, but it's thorougly untested [10:51:25] "wikidata/gremlin is going to need Java 8 soon, we have a Jenkins job that runs maven to build it and thus need Java 8 installed on the CI slaves." [10:52:45] <_joe_> yes I've read the commit message [10:53:20] <_joe_> I don't really see how running CI with a basically untested java env can be productive [10:54:00] (03PS1) 10Alexandros Kosiaris: Add the apertium.svc.eqiad.wmnet DNS record [dns] - 10https://gerrit.wikimedia.org/r/183223 [10:55:20] _joe_: that follow up a discussion with manybubble. We will need to run the maven job for wikidata/gremlin to use Java 8. [10:56:00] _joe_: the default java version is still java 7 (set via debian alternative). Each maven Jenkins job explicitly set the java version to use, so the existing jobs are kept on v 7 [10:58:08] (03PS1) 10Alexandros Kosiaris: Apply the role::apertium::production role to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/183224 [11:00:20] <_joe_> akosiaris: fancy trying to use the "role" keyword there? [11:00:48] <_joe_> (I'd probably create a super-role including mathoid, citoid and apertium as well) [11:02:22] _joe_: I am up for it. [11:02:22] I am planning btw on creating a *oid module [11:02:29] most of this services look pretty much alike [11:02:37] PROBLEM - puppet last run on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:02:37] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:02:41] <_joe_> and use hiera to configure it? [11:02:48] :-D [11:02:50] (03CR) 10Filippo Giunchedi: "yeah that's true, we're going to have a carbon-c-relay class anyway for the relay listening on standard port 2003 as opposed to the local " [puppet] - 10https://gerrit.wikimedia.org/r/181080 (owner: 10Filippo Giunchedi) [11:02:57] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:07] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:17] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:17] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:36] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:37] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:56] PROBLEM - RAID on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:07] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:26] PROBLEM - configured eth on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:26] PROBLEM - dhclient process on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:36] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:46] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:56] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:56] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:06] PROBLEM - puppet last run on mw1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:06] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:46] RECOVERY - very high load average likely xfs on ms-be2003 is OK: OK - load average: 14.45, 3.94, 1.36 [11:07:07] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [11:07:18] RECOVERY - RAID on mw1011 is OK: OK: no RAID installed [11:07:26] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:07:27] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:07:47] RECOVERY - configured eth on mw1011 is OK: NRPE: Unable to read output [11:07:57] RECOVERY - DPKG on mw1011 is OK: All packages OK [11:07:57] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [11:08:07] RECOVERY - Disk space on mw1011 is OK: DISK OK [11:08:07] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [11:08:16] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [11:08:17] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:08:26] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:08:46] RECOVERY - configured eth on mw1012 is OK: NRPE: Unable to read output [11:08:47] RECOVERY - DPKG on mw1012 is OK: All packages OK [11:08:56] RECOVERY - dhclient process on mw1012 is OK: PROCS OK: 0 processes with command name dhclient [11:09:06] RECOVERY - Disk space on mw1012 is OK: DISK OK [11:09:37] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 4 failures [11:09:53] akosiaris: oid module would be really nice [11:10:39] akosiaris: also on beta cluster we ended up with an instance per oid service, would be nice to have them to share a common instance (or maybe a couple of them) [11:10:57] yeah, it would mirror production better [11:12:16] and also have to get rid of the lame Jenkins based solution to deploy them [11:12:27] Bryan told me we could certainly use Trebuchet now [11:12:44] yeah, it works a lot better now [11:12:53] so a merge on one of the repo would trigger a thin Jenkins job that just invoke Trebuchet and refresh the code on the shared oid instances [11:13:11] I was about to ask exactly that [11:13:34] currently each oid instance has to be made a jenkins slave [11:13:35] ok, I 'll put that in my TODO for when creating that module [11:13:48] + we need a custom job for each oid service that essentially duplicate what is in trebuchet [11:13:55] (03PS2) 10Alexandros Kosiaris: Apply the role::apertium::production role to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/183224 [11:14:08] _joe_: ^ wanna have a look ? [11:14:17] we can probably pair it together when we both have some cycles :-] [11:14:23] meanwhile, it is lunch time! [11:14:28] hashar: same here [11:14:46] hashar: We can probably get around to do it while in SF [11:14:59] <_joe_> akosiaris: role::mediawiki::apertium_port ? [11:15:07] sigh... [11:15:12] fixing... [11:15:30] <_joe_> also, you need to define it as a class parameter [11:15:41] <_joe_> or hiera won't look it up automagically [11:15:42] akosiaris: maybe :-] [11:15:56] lunch [11:16:01] <_joe_> (or use $apertium_port = hiera('yourvariable')) [11:16:17] _joe_: gonna go for the later. I 'd rather not parameterize the role class [11:16:37] <_joe_> akosiaris: that's why I offered you the alternative :) [11:17:37] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:18:47] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:20:09] (03PS3) 10Alexandros Kosiaris: Apply the role::apertium::production role to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/183224 [11:22:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "minor comments, lgtm otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183224 (owner: 10Alexandros Kosiaris) [11:25:48] (03Draft1) 10Filippo Giunchedi: add missing build-depends [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/183066 [11:27:37] (03PS1) 10Giuseppe Lavagetto: mediawiki: create "canary" pools to allow testing on subclusters [puppet] - 10https://gerrit.wikimedia.org/r/183226 [11:27:39] (03PS1) 10Giuseppe Lavagetto: mediawiki: use the worker mpm on the canary clusters [puppet] - 10https://gerrit.wikimedia.org/r/183227 [11:31:55] (03CR) 10Alexandros Kosiaris: Apply the role::apertium::production role to sca cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183224 (owner: 10Alexandros Kosiaris) [11:33:41] (03PS4) 10Alexandros Kosiaris: Apply the role::apertium::production role to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/183224 [11:36:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Apply the role::apertium::production role to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/183224 (owner: 10Alexandros Kosiaris) [11:42:54] <_joe_> !log reimaging mw1009-mw1012 [11:42:59] Logged the message, Master [11:50:58] (03PS3) 10Filippo Giunchedi: graphite: introduce local c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 [11:51:43] (03CR) 10Filippo Giunchedi: [C: 04-1] "on hold until we have graphite hw in place with trusty (carbon-c-relay is trusty only)" [puppet] - 10https://gerrit.wikimedia.org/r/181080 (owner: 10Filippo Giunchedi) [11:55:25] (03PS5) 10Alexandros Kosiaris: Apply the role::apertium::production role to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/183224 [11:59:32] (03PS1) 10Yuvipanda: tools: Redirect tools-static to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/183230 [11:59:34] (03PS1) 10Yuvipanda: tools: Make tools-static serve from www/static [puppet] - 10https://gerrit.wikimedia.org/r/183231 [12:01:05] (03CR) 10Yuvipanda: [C: 032] tools: Redirect tools-static to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/183230 (owner: 10Yuvipanda) [12:01:23] (03CR) 10Yuvipanda: [C: 032] tools: Make tools-static serve from www/static [puppet] - 10https://gerrit.wikimedia.org/r/183231 (owner: 10Yuvipanda) [12:10:14] (03PS1) 10Alexandros Kosiaris: LVS IP assignment for cxserver [dns] - 10https://gerrit.wikimedia.org/r/183232 [12:10:42] (03CR) 10Alexandros Kosiaris: [C: 032] Apply the role::apertium::production role to sca cluster [puppet] - 10https://gerrit.wikimedia.org/r/183224 (owner: 10Alexandros Kosiaris) [12:11:45] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [12:11:45] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [12:12:19] oh damn [12:12:20] that’s me [12:12:56] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [12:13:16] akosiaris: I merged yours too. hope that’s ok. [12:14:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:14:59] YuviPanda: thanks, I was about to [12:24:23] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 8 failures [12:29:12] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 2 failures [12:30:22] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [12:32:52] (03PS1) 10Yuvipanda: tools: Set default charset of tools-static to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/183235 [12:33:33] (03PS2) 10Yuvipanda: tools: Set default charset of tools-static to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/183235 [12:33:42] (03PS1) 10Reedy: Remove /home/wikipedia/conf config stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183236 [12:34:08] (03CR) 10Reedy: [C: 032] Remove /home/wikipedia/conf config stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183236 (owner: 10Reedy) [12:34:12] (03Merged) 10jenkins-bot: Remove /home/wikipedia/conf config stuff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183236 (owner: 10Reedy) [12:34:42] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:34:47] (03CR) 10Yuvipanda: [C: 032] tools: Set default charset of tools-static to utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/183235 (owner: 10Yuvipanda) [13:26:28] (03PS1) 10Alexandros Kosiaris: Cleanup cxserver module/role [puppet] - 10https://gerrit.wikimedia.org/r/183240 [13:26:30] (03PS1) 10Alexandros Kosiaris: Use hiera for cxserver port [puppet] - 10https://gerrit.wikimedia.org/r/183241 [13:26:32] (03PS1) 10Alexandros Kosiaris: Apply cxserver role to sca [puppet] - 10https://gerrit.wikimedia.org/r/183242 [13:26:34] (03PS1) 10Alexandros Kosiaris: LVS for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/183243 [13:27:50] (03CR) 10jenkins-bot: [V: 04-1] LVS for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/183243 (owner: 10Alexandros Kosiaris) [13:29:41] (03PS2) 10Alexandros Kosiaris: LVS for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/183243 [14:03:07] (03CR) 10Ottomata: "Sure, I know one can override, but I think by default it should know how to talk to archiva and only archiva." [puppet] - 10https://gerrit.wikimedia.org/r/170668 (owner: 10QChris) [14:04:18] (03PS3) 10Ottomata: Use monitoring::graphite_threshold for varnishkafka delivery error check [puppet] - 10https://gerrit.wikimedia.org/r/182860 [14:06:00] (03CR) 10Ottomata: [C: 032] Use monitoring::graphite_threshold for varnishkafka delivery error check [puppet] - 10https://gerrit.wikimedia.org/r/182860 (owner: 10Ottomata) [14:09:53] (03PS1) 10Ottomata: Change name of varnishkafka drerr graphite_threshold check [puppet] - 10https://gerrit.wikimedia.org/r/183250 [14:10:01] PROBLEM - puppet last run on amssq40 is CRITICAL: CRITICAL: puppet fail [14:10:01] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [14:10:01] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: puppet fail [14:10:10] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: puppet fail [14:10:11] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [14:10:11] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: puppet fail [14:10:21] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail [14:10:21] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: puppet fail [14:10:31] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [14:10:51] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [14:10:54] that's me [14:10:56] am fixing [14:11:00] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: puppet fail [14:11:12] (03CR) 10Ottomata: [C: 032] Change name of varnishkafka drerr graphite_threshold check [puppet] - 10https://gerrit.wikimedia.org/r/183250 (owner: 10Ottomata) [14:11:21] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: puppet fail [14:11:40] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: puppet fail [14:11:50] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: puppet fail [14:12:00] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: puppet fail [14:12:01] PROBLEM - puppet last run on cp1038 is CRITICAL: CRITICAL: puppet fail [14:12:21] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: puppet fail [14:12:51] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [14:13:11] PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: puppet fail [14:13:21] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: puppet fail [14:13:30] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [14:14:01] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [14:14:22] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [14:14:40] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:16:47] !log xtrabackup clone db1027 to db1015 [14:16:52] Logged the message, Master [14:22:52] ah, running to a cafe, will be back shortly. I just merged an icinga check. i hope it works as it is supposed to and doesn't spam the channel here while I am gone... [14:24:57] (03CR) 10KartikMistry: [C: 031] Use hiera for cxserver port [puppet] - 10https://gerrit.wikimedia.org/r/183241 (owner: 10Alexandros Kosiaris) [14:25:45] (03PS5) 10Faidon Liambotis: redis: Get rid of unused (and unhelpful) module parameters [puppet] - 10https://gerrit.wikimedia.org/r/183060 (owner: 10Ori.livneh) [14:25:52] (03CR) 10Faidon Liambotis: [C: 032] redis: Get rid of unused (and unhelpful) module parameters [puppet] - 10https://gerrit.wikimedia.org/r/183060 (owner: 10Ori.livneh) [14:26:33] <_joe_> akosiaris: give me 10 minutes to review your patches [14:27:32] (03CR) 10KartikMistry: [C: 031] Cleanup cxserver module/role [puppet] - 10https://gerrit.wikimedia.org/r/183240 (owner: 10Alexandros Kosiaris) [14:28:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM but see the comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183242 (owner: 10Alexandros Kosiaris) [14:29:25] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:29:26] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [14:29:35] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:29:36] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:29:46] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:56] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:30:06] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:30:07] (03CR) 10Giuseppe Lavagetto: [C: 031] LVS for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/183243 (owner: 10Alexandros Kosiaris) [14:30:15] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:30:26] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [14:30:26] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:30:26] RECOVERY - puppet last run on amssq41 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:30:45] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:31:06] RECOVERY - puppet last run on cp1038 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:31:15] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:31:16] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:31:35] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:31:36] RECOVERY - puppet last run on amssq40 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:32:27] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:32:47] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [14:32:55] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [14:33:36] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:33:56] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:34:59] <_joe_> can we please move wikibugs back here? [14:35:10] I've submitted a patch already [14:35:27] https://gerrit.wikimedia.org/r/183247 [14:36:00] Getting reports of trouble with SUL in -en-help - ping legoktm [14:36:49] marktraceur: Troubles with SUL... oh I never even... :P [14:36:52] What kinds of troubles? [14:37:22] hoo: Workster says he gets a server error when he tries. Trying to get more information out of him. [14:37:34] Tries what? [14:37:35] 2015-01-07 - 08:33:24 It ends up just saying that the servers are experiencing a technical problem [14:37:53] <_joe_> so a 503 [14:37:55] on login? [14:37:58] 2015-01-07 - 08:37:53 Error: 503, Service Unavailable at Wed, 07 Jan 2015 14:37:38 GMT [14:38:14] On POST http://en.wikipedia.org/w/index.php?title=Special:MergeAccount&action=submit [14:38:17] (03CR) 10QChris: "> but I think by default it should know how to talk to" [puppet] - 10https://gerrit.wikimedia.org/r/170668 (owner: 10QChris) [14:38:21] _joe_: Remember you wanted to help me debug exactly that? [14:38:31] <_joe_> hoo: yes [14:38:39] Now's your chance! [14:38:55] <_joe_> but I thought o.ri and legoktm were able to find a solution [14:39:02] Where they? [14:39:07] * were [14:39:21] <_joe_> I'm not sure, tbh [14:39:30] <_joe_> but, is the user still online? [14:39:30] Apparently not [14:39:35] Yeah, he's in -en-help [14:39:41] I didn't see any patch regarding that [14:40:01] #wikipedia-en-help [14:40:05] Oh, good. [14:40:47] <_joe_> let's see if this is related [14:40:55] <_joe_> !log upgrading hhvm on testwiki [14:40:58] Logged the message, Master [14:45:07] (03PS2) 10Hashar: contint: install Java 8 on Trusty servers [puppet] - 10https://gerrit.wikimedia.org/r/183222 [14:45:47] (03CR) 10Hashar: [C: 031 V: 032] "Applied on contint puppetmaster and confirmed to work on Trusty as well as Precise instances. The agents are still running on java 7 as w" [puppet] - 10https://gerrit.wikimedia.org/r/183222 (owner: 10Hashar) [14:57:43] _joe_: I tried to track this down for quite some time now, but I'm out of ideas [14:58:07] <_joe_> hoo: I may have an idea [15:00:22] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [15:00:45] _joe_: Wow... that's an unexpected bug [15:01:49] <_joe_> hoo: https://phabricator.wikimedia.org/T85812 [15:03:04] Ah nice... any idea on how long that will take? [15:03:17] I guess the package is there yet and being tested right now [15:04:25] (03CR) 10Hashar: [C: 031] "As discussed during our weekly Releng checkin." [puppet] - 10https://gerrit.wikimedia.org/r/183062 (owner: 10John F. Lewis) [15:04:36] <_joe_> hoo: this week it should be everywhere [15:04:52] :) [15:10:32] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [15:13:59] (03CR) 10Alexandros Kosiaris: Apply cxserver role to sca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183242 (owner: 10Alexandros Kosiaris) [15:23:10] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: Puppet has 9 failures [15:25:30] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:27:03] (03PS2) 10Giuseppe Lavagetto: Make upload.wikimedia.org set Timing-Allow-Origin [puppet] - 10https://gerrit.wikimedia.org/r/181405 (owner: 10Unicodesnowman) [15:28:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Make upload.wikimedia.org set Timing-Allow-Origin [puppet] - 10https://gerrit.wikimedia.org/r/181405 (owner: 10Unicodesnowman) [15:31:05] (03PS1) 10Filippo Giunchedi: graphite: limit uwsgi workers memory [puppet] - 10https://gerrit.wikimedia.org/r/183256 [15:31:20] 3Wikimedia-General-or-Unknown, operations, WMF-Legal: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#959524 (10Aklapper) [15:34:39] _joe_: paravoid ^ is back here [15:34:52] 3ops-core: Update the php5 luasandbox package for trusty - https://phabricator.wikimedia.org/T85925#959602 (10Joe) If this is the case, we may just need to ensure all production is at the same update level. [15:35:03] <_joe_> YuviPanda: thanks [15:36:55] !log xtrabackup clone codfw slaves db2034 db2035 db2036 db2037 db2038 db2039 db2040 from other codfw slaves [15:37:01] Logged the message, Master [15:37:41] springle: replicate all the things! [15:37:42] :) [15:38:06] :) [15:39:29] (03CR) 10Filippo Giunchedi: "how this will be handled on the deployment/scap side?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183226 (owner: 10Giuseppe Lavagetto) [15:39:31] (03PS2) 10Alexandros Kosiaris: Use hiera for cxserver port [puppet] - 10https://gerrit.wikimedia.org/r/183241 [15:39:33] (03PS2) 10Alexandros Kosiaris: Cleanup cxserver module/role [puppet] - 10https://gerrit.wikimedia.org/r/183240 [15:39:35] (03PS3) 10Alexandros Kosiaris: LVS for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/183243 [15:39:37] (03PS2) 10Alexandros Kosiaris: Apply cxserver role to sca [puppet] - 10https://gerrit.wikimedia.org/r/183242 [15:41:13] 3ops-codfw: ms-be2003.codfw.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T85591#959760 (10Papaul) yes working on it will update the ticket in a minute [15:42:58] (03PS1) 10Springle: assign a round of codfw slaves to shards [puppet] - 10https://gerrit.wikimedia.org/r/183258 [15:46:12] (03CR) 10Springle: [C: 04-2] "Wait until DB cloning is done lest icinga moan and whinge." [puppet] - 10https://gerrit.wikimedia.org/r/183258 (owner: 10Springle) [15:50:39] * anomie sees nothing for SWAT [15:52:06] 3Phabricator, operations: Create #site-incident tag and use it for incident reports - https://phabricator.wikimedia.org/T85889#959802 (10Aklapper) See previous discussion in T929. Might want to close this as a dup and argument there instead, to have discussion in one place? [15:53:39] <^d> anomie: I'll do it today! [15:57:01] (03CR) 10Hashar: "I am not sure why you removed the $common_settings variable, I guess you don't want the hhvm::init to be cluttered with too many options. " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/178806 (https://phabricator.wikimedia.org/T75356) (owner: 10Hashar) [16:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150107T1600). [16:01:48] manybubbles: ready? :) [16:30:56] 3ops-codfw: racktables output for codfw - https://phabricator.wikimedia.org/T86019#959904 (10Cmjohnson) 3NEW a:3RobH [16:32:30] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [16:32:38] 3MediaWiki-Vagrant, operations: mwscript importDump.php fails to properly import some valid xml exports - https://phabricator.wikimedia.org/T73354#959914 (10Gilles) [16:32:40] 3ops-core: Update the php5 luasandbox package for trusty - https://phabricator.wikimedia.org/T85925#959911 (10Gilles) 5Open>3Invalid a:3Gilles Indeed @Anomie it seems like the vagrant role didn't upgrade the package on its own, I had 2.0-7 installed, not 2.0-7+wmf2.1. Once updated via apt, the issue I was... [16:33:22] 3MediaWiki-Vagrant, operations: mwscript importDump.php fails to properly import some valid xml exports - https://phabricator.wikimedia.org/T73354#959915 (10Gilles) 5Open>3Resolved a:3Gilles Updating to the latest php-luasandbox with apt solved the issue. [16:36:26] (03PS1) 10Jgreen: add neutral DMARC policy to wikipedia.* domains [dns] - 10https://gerrit.wikimedia.org/r/183262 [16:39:36] chasemp: Who's project lead of "assigned-setter" for things like https://phabricator.wikimedia.org/T36685. Resetting priority of those and others seem like a paradox. Someoone should be responsible for triaging it and assigning it for someone to work on, right? It doesn't happen by itself (volunteering is nice but that doesn't scale). [16:40:29] E.g. Who is the James_F of operations? [16:40:49] priorities are very large topic but in general ops has their own priority mandates and then there is a lot of things that come in from other channels [16:41:00] but essentially teh on duty person is the in-the-moment assigned setter [16:41:06] but for that issue there is no one to assign it to [16:41:38] no one in ops is actually going to work on that now so there is no point in assigning it to someone when we know they are not actually going to work on it [16:42:11] and afaik that task isn't anywhere on anyone in ops's list of things in the near future so needs volunteer means, someone could do this but it's not on any list to get done now [16:42:21] that is my understanding [16:42:50] if there was a wishlist priority I would have set that [16:43:25] oh hey, bug notices! [16:45:06] chasemp: That sounds like there's a problem in the workflow, or a need for more man power? There will always be higher priorities. So might as well be wontfix or "will be done by someone who should be doing higher priority assigned work but is choosing to do this instead". [16:45:17] Does someone want to look in the transcode error logs for me? [16:45:18] https://commons.wikimedia.org/wiki/File:Logaritmer_4_N%C3%A5gra_fler_exempel.webm [16:45:26] well yes, there is limited resources [16:45:31] Pages with that error in them seem to be causing some trouble for a user [16:45:35] and those resources are pretty well engaged [16:45:40] on top of that I really dislike priority inflation [16:45:42] I can't repro, but maybe the logs have some info [16:45:54] if it's not actually gonig to be worked on, if it's not actually on any road map anywhere for anyone [16:46:01] then it has no priority, and it needs someone to volunteer [16:46:05] yes essentially [16:46:48] feel free to escalate to mark to get it on ops priority list or appeal that it is more important than it has been treated [16:46:57] chasemp: offtopic, but, after reading your email, should I just remove the #ops project from https://phabricator.wikimedia.org/T85936 (and leave the #ops-access-requests one)? [16:47:07] but an open task since 2012 with no assignee and no outlook for actual resources is needs volunteer [16:47:29] greg-g: I feel like yes, but so far no feedback from other ops so I will cleanup there is people want that workflow [16:47:33] (for this specific issues) having a recent changes feed is an expected aspect of any production-run wiki. in the category of "you dont' set up a wiki without it". It's low priority, sure, but should fit into a workflow that doesn't depend on volunteering (which is hard anyway since there's practically no volunteer opsen) [16:47:45] chasemp: /me nods [16:48:07] anyway, #not-my-problem. Was curious about how it works, and thanks for letting me now. [16:48:08] Krinkle: I'm not arguing it's not something we want, I'm saying it's not something that right now has //any resources assigned ever// [16:48:23] needs volunteer means if it's important someone who knows why should get involved [16:48:24] imo [16:49:12] I really have no strong feelings on that issue :) but I'll cut down priority on anything where it doesn't reflect reality when i'm on ops duty as long as I understand that to be my role [16:49:29] sure [16:50:14] do I keep breaking graphite with complicated grafana queries? [16:50:19] greg-g: on that email did you grok it ok? does that process make sense to you? as a consumer of it [16:50:35] ottomata: maybe graphite will stall with a seriously complex (hits a lot of whisper files) query sometimes [16:50:41] chasemp: it does [16:51:13] chasemp: to me, as I understand it, I can either be smart and put the right project (eg: ops-access-requests) or be dumb and just put #ops and let the Phab Duty person triage it [16:51:46] (right?) [16:52:21] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.007 second response time [16:52:24] I think it hasn't been decided exactly [16:52:29] in general yes [16:52:30] yeah but uh totally bork ^ [16:53:31] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.023 second response time [16:54:23] 3ops-codfw: ms-be2003.codfw.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T85591#959977 (10Papaul) 5Open>3Resolved a:3Papaul Bad drive replaced, bad drive in shipping for return. [16:55:49] btw https://gerrit.wikimedia.org/r/#/c/183256/ should help graphite-web not going crazy over big requests [16:55:58] which I suspect is what happened above [17:00:21] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.01 [17:05:30] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [17:05:31] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [17:09:38] 3Wikimedia-General-or-Unknown, operations, WMF-Legal: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#960005 (10chasemp) @mark, do you know what license we should put in a LICENSE file in the puppet repo? [17:11:31] 3operations, Parsoid: Parsoid should use SO_REUSEADDR when it binds to its port - https://phabricator.wikimedia.org/T75395#960011 (10chasemp) >>! In T75395#956056, @GWicke wrote: >>>! In T75395#834249, @GWicke wrote: >> @yuvipanda: We don't have root on those boxes, so that would need to be somebody in ops. > >... [17:12:31] (03CR) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [17:13:32] (03PS4) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [17:15:49] !log reboot ms-be2003 [17:15:51] Logged the message, Master [17:18:50] 3ops-codfw: ms-be2003.codfw.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T85591#960019 (10fgiunchedi) 5Resolved>3Open [17:20:51] !log ori Synchronized php-1.25wmf12/extensions/EventLogging/modules/ext.eventLogging.core.js: I5470424: Correct events to send schema name (duration: 00m 06s) [17:20:55] Logged the message, Master [17:20:56] !log ori Synchronized php-1.25wmf13/extensions/EventLogging/modules/ext.eventLogging.core.js: I5470424: Correct events to send schema name (duration: 00m 05s) [17:20:59] Logged the message, Master [17:21:00] ^ nuria [17:25:37] 3operations, Parsoid: Parsoid should use SO_REUSEADDR when it binds to its port - https://phabricator.wikimedia.org/T75395#960024 (10GWicke) @chasemp, yes. [17:26:09] 3operations, Parsoid: Parsoid restarts not completely reliable with upstart - https://phabricator.wikimedia.org/T75395#960025 (10GWicke) a:3GWicke [17:28:15] ori: right here sorry [17:34:03] ori: holaaaa [17:34:11] 3ops-codfw: ms-be2003.codfw.wmnet: slot=3 dev=sdd failed - https://phabricator.wikimedia.org/T85591#960058 (10fgiunchedi) 5Open>3Resolved disk replaced and back in service, currently refilling ``` /dev/sdd1 1.9T 3.3G 1.9T 1% /srv/swift-storage/sdd1 ``` [17:43:18] (03PS2) 10Reedy: Rebuild beta apache config ontop of production config [puppet] - 10https://gerrit.wikimedia.org/r/173492 [17:48:36] 3ops-codfw: racktables output for codfw - https://phabricator.wikimedia.org/T86019#960076 (10RobH) a:5RobH>3Cmjohnson Discussed this with Chris in IRC, I'll upload the output to db1001, and also give instructions on how to generate this data without my input in the future (since its just an SQL query.) Gran... [17:53:15] 3operations, ContentTranslation-cxserver: Deploy apertium in production - https://phabricator.wikimedia.org/T86026#960088 (10akosiaris) 3NEW a:3akosiaris [17:54:08] (03PS1) 10Alexandros Kosiaris: Adjust the check_http command for apertium monitoring [puppet] - 10https://gerrit.wikimedia.org/r/183278 [17:54:37] 3operations, ContentTranslation-cxserver: Deploy apertium in production - https://phabricator.wikimedia.org/T86026#960088 (10akosiaris) Assigning to host: https://gerrit.wikimedia.org/r/#/c/183224/ (merged) DNS: https://gerrit.wikimedia.org/r/#/c/183223/ Monitoring fix: https://gerrit.wikimedia.org/r/183278 [17:54:46] 3operations, MediaWiki-Core-Team, MediaWiki-ResourceLoader: Bad cache stuck due to race condition with scap between different web servers - https://phabricator.wikimedia.org/T47877#960111 (10chasemp) @greg since this involves scap is this a release engineering thing? Would you be willing to kind of shepherd thi... [17:57:52] 3operations, MediaWiki-Core-Team, MediaWiki-ResourceLoader: Bad cache stuck due to race condition with scap between different web servers - https://phabricator.wikimedia.org/T47877#960136 (10bd808) I think this is really a bug about the fact that we don't depool/repool servers during a the rolling update process... [18:05:00] Out of curiosity - why are we having an exim queued bounce messages and an exim_queud_bounce_messages here http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=Miscellaneous+eqiad&h=polonium.wikimedia.org&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [18:05:25] both show different results though [18:05:33] (03PS1) 10Ottomata: Fix for varnishkafka kafka_drerr graphite query for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/183279 [18:06:34] (03PS2) 10Ottomata: Fix for varnishkafka kafka_drerr graphite query for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/183279 [18:08:03] The script is modules/exim4/files/ganglia/exim-to-gmetric [18:09:07] Nemo_bis: but different results in both are more interesting [18:15:30] (03CR) 10Ottomata: [C: 032] Fix for varnishkafka kafka_drerr graphite query for icinga check [puppet] - 10https://gerrit.wikimedia.org/r/183279 (owner: 10Ottomata) [18:18:36] Hi... Anyone have any idea about the failing test that is keeping this patch from merging? it seems unrelated to the patch itself https://gerrit.wikimedia.org/r/#/c/173452 [18:19:11] AndyRussG: If it's MWTidy on HHVM: Ignore it [18:19:35] oh it's not [18:19:36] hoo: thanks! it's AutoLoaderTest [18:19:47] seems like your autoloader actually isn't correct [18:19:49] fix that [18:20:02] Yeah I mean that wouldn't surprise me, but I think it's unrelated to the patch [18:20:17] sure :D [18:20:33] So I'm wondering if it makes sense to override jenkins and force the merge, since the patch is needed and it'd be nice to get it on the train... [18:21:44] AndyRussG: Or quickly fix the autoloader [18:21:48] will take you a minute or so [18:21:56] I can +2 if you need [18:22:15] hoo: do you know specifically what the autoloader issue is? [18:22:38] You ahve classes that aren't in there [18:22:54] probably some helper classes you only use in the files they're defined in [18:23:03] so nobody bothered to add them to the autoloader before [18:23:30] yetch [18:23:35] (03CR) 10Alexandros Kosiaris: [C: 032] Adjust the check_http command for apertium monitoring [puppet] - 10https://gerrit.wikimedia.org/r/183278 (owner: 10Alexandros Kosiaris) [18:32:34] 3operations: Multiple entries exists for each matrix with minor change in matrix naming - https://phabricator.wikimedia.org/T86034#960211 (1001tonythomas) 3NEW [18:32:52] Nemo_bis: reported https://phabricator.wikimedia.org/T86034 [18:34:38] (03PS1) 10Tpt: Enable "Other projects sidebar" by default on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183288 (https://phabricator.wikimedia.org/T85971) [18:41:31] 3operations: Multiple entries exists for each matrix with minor change in matrix naming - https://phabricator.wikimedia.org/T86034#960246 (10chasemp) @faidon, can you speak to this? You are probably the most qualified to debunk or outline the problem. [18:42:48] !log reboot ms-be2011, megacli in a funny state and unable to bring new drive in service [18:42:50] Logged the message, Master [18:45:06] PROBLEM - Varnishkafka Delivery Errors per minute on cp3003 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [30.0] [18:48:31] <^d> godog: I finished draining lsearchd pools yesterday, ticket's in your hands now for Total Recall :) [18:49:07] ^d: wohoo \o/ will update the ticket tomorrow with a more detailed plan [18:49:19] <^d> okie dokie [18:49:46] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183293 [18:49:48] (03PS1) 10Reedy: testwiki to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183294 [18:49:50] (03PS1) 10Reedy: wikipedias to 1.25wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183295 [18:49:52] (03PS1) 10Reedy: group0 to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183296 [18:50:01] godog: step 1) buy a hammer [18:50:06] step 2) book transport to EQIAD [18:50:49] 3operations, MediaWiki-Core-Team, MediaWiki-ResourceLoader: Bad cache stuck due to race condition with scap between different web servers - https://phabricator.wikimedia.org/T47877#960322 (10chasemp) So we think this is inherent to the design of our deploy mechanisms? @Joe, is there a ticket somwhere for revamp... [18:51:28] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183293 (owner: 10Reedy) [18:51:31] <^d> Reedy: Those servers actually are worth reusing :p [18:51:32] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183293 (owner: 10Reedy) [18:51:36] <^d> They have a crapton of ram. [18:52:06] RECOVERY - Varnishkafka Delivery Errors per minute on cp3003 is OK: OK: Less than 1.00% above the threshold [0.0] [18:52:17] I guess in some of the newer machines it's easily reusable [18:53:09] Reedy: heheh [18:53:11] (03Abandoned) 10Reedy: Apache config for foundationwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147484 (owner: 10Reedy) [18:53:18] (03Abandoned) 10Reedy: Apache config for Wikimania wikis using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147483 (owner: 10Reedy) [18:53:20] (03Abandoned) 10Reedy: Apache config for metawiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147481 (owner: 10Reedy) [18:53:36] 3operations, Scrum-of-Scrums, Zero: HTTPS-to-HTTP downgrade option interstitial - https://phabricator.wikimedia.org/T76626 (10dr0ptp4kt) a:5dr0ptp4kt>3Yurik [18:54:06] (03Abandoned) 10Reedy: Apache config for sourceswiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147480 (owner: 10Reedy) [18:54:08] (03Abandoned) 10Reedy: Apache config for commonswiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147479 (owner: 10Reedy) [18:54:11] (03Abandoned) 10Reedy: Apache config for grantswiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147478 (owner: 10Reedy) [18:54:14] (03Abandoned) 10Reedy: Apache config for fdcwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147477 (owner: 10Reedy) [18:55:48] (03Abandoned) 10Reedy: Apache config for internalwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147476 (owner: 10Reedy) [18:55:53] (03Abandoned) 10Reedy: Apache config for boardwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147475 (owner: 10Reedy) [18:55:54] 3operations: improve cisco boxes raid monitoring - https://phabricator.wikimedia.org/T85529#960372 (10fgiunchedi) p:5Triage>3Normal [18:55:57] (03Abandoned) 10Reedy: Apache config for boardgovcomwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147474 (owner: 10Reedy) [18:56:01] (03Abandoned) 10Reedy: Apache config for spcomwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147473 (owner: 10Reedy) [18:57:12] (03Abandoned) 10Reedy: Apache config for chapcomwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147472 (owner: 10Reedy) [18:57:15] (03Abandoned) 10Reedy: Apache config for incubatorwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147471 (owner: 10Reedy) [18:57:19] (03Abandoned) 10Reedy: Apache config for specieswiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147470 (owner: 10Reedy) [18:57:22] (03Abandoned) 10Reedy: Apache config for searchcomwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147469 (owner: 10Reedy) [18:57:25] (03Abandoned) 10Reedy: Apache config for usabilitywiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147467 (owner: 10Reedy) [18:58:11] (03Abandoned) 10Reedy: Apache config for strategywiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147466 (owner: 10Reedy) [18:58:16] (03Abandoned) 10Reedy: Apache config for officewiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147465 (owner: 10Reedy) [18:58:18] (03Abandoned) 10Reedy: Apache config for chairwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147464 (owner: 10Reedy) [18:58:22] (03Abandoned) 10Reedy: Apache config for advisorywiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147463 (owner: 10Reedy) [18:58:25] (03Abandoned) 10Reedy: Apache config for auditcomwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147462 (owner: 10Reedy) [18:58:28] (03Abandoned) 10Reedy: Apache config for qualitywiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147461 (owner: 10Reedy) [18:59:04] (03CR) 10Ori.livneh: [C: 031] "Settings look sane." [puppet] - 10https://gerrit.wikimedia.org/r/178806 (https://phabricator.wikimedia.org/T75356) (owner: 10Hashar) [18:59:57] _joe_: will you have a chance to look at https://gerrit.wikimedia.org/r/#/c/183171/ again? [19:00:04] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150107T1900). Please do the needful. [19:00:18] (03Abandoned) 10Reedy: Apache config for otrswiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147460 (owner: 10Reedy) [19:00:20] (03Abandoned) 10Reedy: Apache config for collabwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147459 (owner: 10Reedy) [19:00:23] (03Abandoned) 10Reedy: Apache config for outreachwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147458 (owner: 10Reedy) [19:00:28] (03Abandoned) 10Reedy: Apache config for movementroleswiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147457 (owner: 10Reedy) [19:00:31] (03Abandoned) 10Reedy: Apache config for checkuserwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147456 (owner: 10Reedy) [19:01:25] i never knew there were so many wiki [19:01:42] 888 all.dblist [19:02:28] (03Abandoned) 10Reedy: Apache config for stewardwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147455 (owner: 10Reedy) [19:03:04] (03Abandoned) 10Reedy: Apache config for ombudsmenwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147454 (owner: 10Reedy) [19:03:07] (03Abandoned) 10Reedy: Apache config for wikimedia chapters using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147453 (owner: 10Reedy) [19:03:10] (03Abandoned) 10Reedy: Apache config for loginwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147451 (owner: 10Reedy) [19:03:14] (03Abandoned) 10Reedy: Apache config for legalteamwiki sing mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147450 (owner: 10Reedy) [19:03:15] (03Abandoned) 10Reedy: Apache config for zerowiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147449 (owner: 10Reedy) [19:03:18] (03Abandoned) 10Reedy: Apache config for transitionteamwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147448 (owner: 10Reedy) [19:04:13] (03Abandoned) 10Reedy: Apache config for iegcomwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147446 (owner: 10Reedy) [19:04:16] (03Abandoned) 10Reedy: Apache config for Wikiversity using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147445 (owner: 10Reedy) [19:04:18] (03Abandoned) 10Reedy: Apache config for Wikinews using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147444 (owner: 10Reedy) [19:04:21] (03Abandoned) 10Reedy: Apache config for Wikisource using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147443 (owner: 10Reedy) [19:04:24] (03Abandoned) 10Reedy: Apache config for Wikibooks using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147442 (owner: 10Reedy) [19:04:26] (03Abandoned) 10Reedy: Apache config for Wikiquote using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147440 (owner: 10Reedy) [19:04:56] (03Abandoned) 10Reedy: Apache config for Wiktionary using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147439 (owner: 10Reedy) [19:05:17] (03Abandoned) 10Reedy: Apache config for mediawikiwiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147438 (owner: 10Reedy) [19:05:20] (03Abandoned) 10Reedy: Apache config for wikidatawiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147436 (owner: 10Reedy) [19:05:23] (03Abandoned) 10Reedy: Apache config for donatewiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147435 (owner: 10Reedy) [19:05:27] (03Abandoned) 10Reedy: Apache config for votewiki using mod_proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/147428 (owner: 10Reedy) [19:06:15] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183294 (owner: 10Reedy) [19:06:19] (03Merged) 10jenkins-bot: testwiki to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183294 (owner: 10Reedy) [19:07:43] jgage: it's wiki's all the way down [19:07:58] !log reedy Started scap: testwiki to 1.25wmf14... [19:08:01] Logged the message, Master [19:08:35] weird, my irc client doesn't believe jgage is in here [19:10:34] (03CR) 10Smalyshev: [C: 031] "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/181612 (owner: 10Manybubbles) [19:13:52] 3operations, Labs-Team: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#960479 (10chasemp) >>! In T85913#958900, @adrianheine wrote: > Thanks folks. Unfortunately, the user name is quite visible in gerrit (where I spend most of my time), and I can't ch... [19:18:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp3004 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [19:20:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [19:21:34] hoo: folowing up on the issue AndyRussG was asking you about, I've made another patch. How can I make testextension-zend run again on the new patch? [19:21:41] https://gerrit.wikimedia.org/r/#/c/173452/ [19:22:35] ragesoss: just put "recheck" as a comment (no quotes) [19:22:56] thanks much Reedy [19:24:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [19:24:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp3008 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [19:26:57] RECOVERY - Varnishkafka Delivery Errors per minute on cp3004 is OK: OK: Less than 1.00% above the threshold [0.0] [19:27:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp3008 is OK: OK: Less than 1.00% above the threshold [0.0] [19:36:31] !log reedy Finished scap: testwiki to 1.25wmf14... (duration: 28m 32s) [19:36:34] Logged the message, Master [19:39:41] (03CR) 10Reedy: [C: 032] wikipedias to 1.25wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183295 (owner: 10Reedy) [19:39:45] (03Merged) 10jenkins-bot: wikipedias to 1.25wmf13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183295 (owner: 10Reedy) [19:40:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [19:42:21] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf13 [19:42:26] Logged the message, Master [19:43:49] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183296 (owner: 10Reedy) [19:43:54] (03Merged) 10jenkins-bot: group0 to 1.25wmf14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183296 (owner: 10Reedy) [19:44:26] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf14 [19:44:29] Logged the message, Master [19:49:14] !log reedy Synchronized wmf-config/InitialiseSettings.php: Enabling error log for a few minutes (duration: 00m 15s) [19:49:25] Logged the message, Master [19:51:17] !log reedy Synchronized wmf-config/InitialiseSettings.php: And disable error log again (duration: 00m 06s) [19:51:21] Logged the message, Master [19:52:30] Reedy: that "recheck" doesn't seem to have worked: https://gerrit.wikimedia.org/r/#/c/173452/ [19:53:15] Rebased it.. [19:53:15] https://integration.wikimedia.org/ci/job/mwext-EducationProgram-testextension-zend/8/ [19:53:20] seems to be going now [19:56:26] 3Graphite, ops-core, Project-Creators, operations, ops-requests, Wikimedia-SSL-related: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#960608 (10Aklapper) +1 to what Chase wrote. Projects with Yellow color and the tag symbol imply that a second compo... [19:56:54] thanks Reedy. Finally passed! I take it the train has departed, though? [19:57:53] Yup [19:58:00] Still my window though [19:58:05] So.. [20:02:50] Reedy: getting that through would be super helpful; it'll unblock a nice set of features for my new project. [20:03:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [20:14:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [20:14:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [20:19:28] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [20:22:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [20:23:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [20:27:07] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1033.266724 [20:30:17] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:36:36] bblack: think you might have a chance to review https://gerrit.wikimedia.org/r/#/c/183171/ ? _joe_'s review was "I like the concept and the implementation, just a couple of syntax errors and one minor concern to address for me. " (I have since addressed them). Tim thought it was a good idea. [20:37:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [20:37:17] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:38:32] cool! graphite checks work [20:38:35] Coren: are you there? [20:38:52] What's up? [20:39:10] I've heard that you can restart Magnus his autolist tool? [20:40:07] I can forcibly restart a tool if you know which one; though I don't think I can powercycle Magnus himself. :-) [20:41:00] http://tools.wmflabs.org/autolist/ looks pretty down for me. Magnus is very hard to reach, normally I ask YuviPanda|zzz but he's away as you can see. [20:41:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [20:41:53] sjoerddebruin: It was indeed down. I've kicked it. [20:42:16] Thank you very much. Will remember your name for the next time this happens. :) [20:42:32] Risks of being reactive. [20:43:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [20:43:14] 3Analytics, operations: Fix Varnishkafka delivery error icinga warning - https://phabricator.wikimedia.org/T76342#960714 (10Ottomata) I turned on a check_graphite check for this today. I'm going to let this be for a few days, and then hopefully turn off the check_ganglia ones next week. [20:44:36] ori: I'm looking, I need to sort through the hit_for_pass thing a bit... [20:45:11] what do you mean? [20:46:04] well, hit_for_pass caches so that similar requests hit as passes, which is a perf/scaling win if the request is always going to be a pass [20:46:28] but if the cache doesn't vary on the X-WM-Debug thing, then you'd be caching that hit-for-pass for everyone, I think [20:46:43] I'm still digging through docs and working my way through the VCL logic, I could be wrong [20:48:25] (but if I'm right, I think in current form some X-WM-Debug requests would cause all future non-debug requests for the same URL to bypass cache for the prod backends as well for some period of time) [20:48:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [20:48:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [20:48:53] right, that'd be bad. maybe (per ) we should return (deliver) instead? [20:49:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [20:50:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp3007 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [20:50:48] probably we should return pass everywhere that hit-for-pass is being used for test. / X-WM-Debug, and set req.hash_ignore_busy as well so that non-debug reqs in parallel for the same URL aren't held up waiting to see that pass happen or whatever. [20:51:04] let me read up some more and I'll get some more definitive answers on all of this and put it in gerrit [20:51:36] bblack: so, i *shouldn't* spam you with a half-dozen untested, speculative commits with minor variations? :) [20:52:05] anyhow, thanks very much [20:54:48] <_joe_> ori: not tonight sorry :) [20:55:04] i roped brandon into it [20:55:12] so it's ok [20:59:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:05] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150107T2100). Please do the needful. [21:00:36] 3Scrum-of-Scrums, RESTBase, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#960768 (10RobH) The order for the hardware has been placed today via RT#9049. The procurement tickets will be the last thing we migrate from RT to Phab, so there was some disconnect in keeping this pr... [21:01:12] I'll got on it jouncebot. [21:01:14] *get [21:01:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp3007 is OK: OK: Less than 1.00% above the threshold [0.0] [21:03:34] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#960774 (10faidon) 3NEW [21:04:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [30.0] [21:05:17] 3Graphite, Project-Creators, ops-requests, ops-core, Wikimedia-SSL-related, operations: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#19908 (10faidon) Yes, please make Graphite a yellow-style tag. I'm fine with your suggestion regarding mail as well... [21:07:31] ori: actually, for this odd case (that it's due to a debug header, that we're not varying on and don't want to), it may be best to return vcl_pipe in recv? that will just open a channel to the backend (through all the layers) and ignore all cache-like things. [21:07:53] otherwise at one layer or another of the VCL, we run into problems with hit-for-pass, or caching the pass, etc, vs non-debug reqs that aren't differentiated there [21:08:09] it was ok for the test hostname because hostnames are already part of the URL and varied on [21:09:21] but the downside is that vcl_pipe is going to bypass a lot of other VCL logic, and thus your test requests won't get the same varnish behaviors as normal ones wrt to various header/url/hostname -munging, etc [21:11:03] maybe there's a better answer if we put that header-check for debug into recv, fetch, and deliver, though... [21:11:31] !log deployed parsoid version 904fab9e [21:11:36] Logged the message, Master [21:13:24] 3Scrum-of-Scrums, RESTBase, operations: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#960818 (10GWicke) @RobH, thank you! [21:14:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [21:15:11] bblack: so, couple of things: one, i think you are right about hit_for_pass. https://www.varnish-cache.org/docs/3.0/tutorial/vcl.html is pretty explicit: "Unlike pass, hit_for_pass will create a hitforpass object in the cache. This has the side-effect of caching the decision not to cache." [21:15:54] two, it appears you can return (pipe) from vcl_fetch [21:16:24] so that seems like it would be the best of both worlds: requests would get filtered through the usual vcl logic, but bypass the cache [21:17:17] yeah maybe [21:17:53] i hate you wikibugs, but yuvi won that arguement so i eat crow ;D [21:17:58] parts of the docs make it seem like return (pass) from vcl_recv would do it too, but other parts seem to indicate that this will still result in hit_for_pass behavior down in vcl_fetch (which again is a problem because we're not varying on the header) [21:18:25] we could vary on the header as well, to be safe [21:18:41] I don't see anywhere in the flowchart for returning pipe from vcl_fetch, only from vcl_recv [21:19:21] errrr, yes, vcl_recv [21:19:52] the problem with that is we do have other logic in later vcl_* that the pipe behavior would bypass, which would lead to your tests not resembling non-test queries in important ways [21:20:12] (I think) [21:20:40] just varying on the debug header and using hit_for_pass is probably the cleanest and safest option [21:20:50] so maybe vary + hit_for_pass is the best option? it was part of the HHVM pool patch that i based my patch on -- i just took it out because i didn't think it was relevant [21:20:50] yeah [21:20:54] (and using it consistently for both the test header and hostname) [21:21:18] should i update the patch, then? [21:21:33] since the debug reqs are presumably low volume, the extra hit-for-pass objects shouldn't affect the overall cache [21:21:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [21:21:38] * ori nods [21:21:41] yeah go for it [21:22:02] cool [21:25:28] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [21:26:28] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [21:26:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [21:27:08] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [21:30:18] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 360.983337 [21:32:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [21:36:38] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:44:15] 3MediaWiki-General-or-Unknown, wikidata-query-service, Services, operations, Wikidata: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#960978 (10GWicke) @halfak has written up very similar ideas at https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_e... [21:44:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [21:46:39] 3MediaWiki-General-or-Unknown, wikidata-query-service, Services, operations, Wikidata: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#960996 (10GWicke) [21:49:30] (03PS1) 10Ottomata: Include base puppetization on new nodes analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/183373 [21:51:07] (03PS2) 10Ottomata: Include base puppetization on new nodes analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/183373 [21:53:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [30.0] [21:54:30] (03CR) 10Ottomata: [C: 032] Include base puppetization on new nodes analytics1001 and analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/183373 (owner: 10Ottomata) [21:57:19] PROBLEM - very high load average likely xfs on ms-be1011 is CRITICAL: CRITICAL - load average: 283.31, 158.84, 76.96 [21:58:32] 3Phabricator, operations: Create #site-incident tag and use it for incident reports - https://phabricator.wikimedia.org/T85889#956687 (10GWicke) @aklapper, lets use this issue to discuss the new proposal. I added a pointer in T929. [22:02:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [22:05:18] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:05:28] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:07:17] 3operations, Wikidata, MediaWiki-General-or-Unknown, Services, wikidata-query-service: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#961051 (10GWicke) [22:07:41] (03PS1) 10Hashar: Duplicate -qa notifcations to -releng [puppet] - 10https://gerrit.wikimedia.org/r/183382 (https://phabricator.wikimedia.org/T86053) [22:13:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp3009 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [22:14:40] (03PS1) 10RobH: decom zinc for reclaim [puppet] - 10https://gerrit.wikimedia.org/r/183383 [22:16:37] (03PS1) 10RobH: reclaim zinc to spares [dns] - 10https://gerrit.wikimedia.org/r/183384 [22:16:59] (03CR) 10RobH: [C: 032] decom zinc for reclaim [puppet] - 10https://gerrit.wikimedia.org/r/183383 (owner: 10RobH) [22:17:28] (03CR) 10RobH: [C: 032] reclaim zinc to spares [dns] - 10https://gerrit.wikimedia.org/r/183384 (owner: 10RobH) [22:17:53] ottomata: i just merged yer stuff [22:17:58] cuz i was merging mine [22:18:08] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:18:17] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [22:18:22] it appeared to all be additional info, not changing live info [22:18:24] so seemed ok [22:19:15] oh [22:19:24] which stuff? [22:19:25] wait [22:19:36] robh? [22:19:39] which commit didn't I merge? [22:19:59] the analytics1001 1002 oine? [22:20:00] one? [22:20:03] base puppetization? [22:20:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp3009 is OK: OK: Less than 1.00% above the threshold [0.0] [22:20:13] oh! [22:20:15] thought i merged that [22:20:17] whoops, cool [22:23:39] ottomata: yep, sorry [22:23:45] i said that then went to phone call, heh [22:26:51] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [22:31:10] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [30.0] [22:32:40] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [22:34:14] (03CR) 10Hashar: "Bryan could you confirm the logstash config sounds sane?" [puppet] - 10https://gerrit.wikimedia.org/r/183382 (https://phabricator.wikimedia.org/T86053) (owner: 10Hashar) [22:36:09] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 712659 msg (=400000 warning): ocg_render_job_queue 3941 msg (=3000 critical) [22:36:10] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 712664 msg (=400000 warning): ocg_render_job_queue 3924 msg (=3000 critical) [22:36:10] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 712669 msg (=400000 warning): ocg_render_job_queue 3888 msg (=3000 critical) [22:37:40] (03CR) 10BryanDavis: Duplicate -qa notifcations to -releng (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183382 (https://phabricator.wikimedia.org/T86053) (owner: 10Hashar) [22:38:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [22:47:42] 3operations, Project-Creators: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#961106 (10chasemp) p:5Triage>3Normal [22:49:38] 3operations, Wikimedia-General-or-Unknown, Graphite: Easy way to define alerts for ganglia data - https://phabricator.wikimedia.org/T59882#961108 (10chasemp) a:5chasemp>3None [22:57:57] 3operations, Wikimedia-SSL-related, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#961124 (10RobH) So we have two options (that I see): - Push gerrit behind misc-web-lb, so it uses the wildcard certifiate (globalsign) and doesn... [22:59:32] 3operations, MediaWiki-Core-Team: Come up with key performance indicators (KPIs) - https://phabricator.wikimedia.org/T784#961126 (10chasemp) p:5High>3Normal [23:02:24] (03PS5) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [23:03:16] 3operations, Wikimedia-General-or-Unknown: Wikimedia sites frequently switching to read-only - https://phabricator.wikimedia.org/T85342#961127 (10chasemp) How can operations help here / is this really blocked on ops? [23:04:38] bblack: updated, and applied successfully in labs [23:05:54] 3operations, Wikimedia-General-or-Unknown: Wikimedia sites frequently switching to read-only - https://phabricator.wikimedia.org/T85342#961131 (10Se4598) [23:13:16] 3operations, Project-Creators: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#961151 (10chasemp) I think my vote is to collapse the bugzilla project into one #Https label project since SSL does not assume any particular team this seems analogous to LDAP or Mail to me. [23:41:26] 3operations: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#961194 (10Jdforrester-WMF) 3NEW [23:44:00] 3operations: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#961207 (10Jdforrester-WMF) [23:44:56] 3operations: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#961194 (10Jdforrester-WMF) >>! In T75901#961097, @Legoktm wrote: > We're not at 100% HHVM yet. imagescalers, job runners, and other servers like terbium and tin. Do we need tickets for terbium... [23:46:19] 3operations, Wikimedia-SSL-related, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#961213 (10Chad) We tried putting it behind misc-web-lb but we had problems with the git protocol behaving through the proxy if memory serves. Th... [23:48:01] 3ops-eqiad, ops-codfw: ship blanking panels from eqiad to codfw - https://phabricator.wikimedia.org/T86082#961216 (10RobH) 3NEW a:3Christopher