[00:10:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [00:19:00] 6Operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2030023 (10ori) p:5Triage>3Unbreak! /var/log/nginx/unified.error.log has not been rotated since July 1. Since Feb 8, it has been growing at a rate of ~550M a day. On sever... [00:24:31] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:32:41] 6Operations, 5Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2030059 (10Peachey88) >>! In T123525#2026727, @Danny_B wrote: > NOTE: This [[ https://www.mediawiki.org/wiki/Phabricator/Project_management/Tracking_tasks | trackin... [00:42:54] 6Operations: Broken logrotation for nginx and varnishkafka on cpXXXX - https://phabricator.wikimedia.org/T127025#2030067 (10Volans) 3NEW [00:47:00] 6Operations: Broken log rotation for nginx and varnishkafka on cpXXXX - https://phabricator.wikimedia.org/T127025#2030075 (10Volans) [00:49:00] 6Operations: Broken log rotation for nginx and varnishkafka on cpXXXX - https://phabricator.wikimedia.org/T127025#2030067 (10Volans) [01:13:38] 6Operations: Broken log rotation for nginx and varnishkafka on cpXXXX - https://phabricator.wikimedia.org/T127025#2030108 (10Volans) Actually digging a bit we have a LOT of logrotate broken around, because by default seems that puppet create the files with owner:group = 998:998 if we don't force it in the puppet... [01:19:38] 6Operations: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2030113 (10Volans) [01:27:01] (03CR) 10Anomie: Undeploy ApiSandbox extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263000 (owner: 10Anomie) [01:27:08] (03PS3) 10Anomie: Undeploy ApiSandbox extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263000 [01:27:50] (03PS4) 10Anomie: Remove $wgMWOAuthGrantPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264438 [01:29:40] 6Operations: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2030126 (10Volans) They need to be checked in each server groups to verify the issue, but from a quick ``` grep -rns -A10 "logrotate.d" * | view - ``` I think that this is the li... [01:31:42] (03CR) 10Anomie: New group/right/protection level for the English Wikipedia: establishededitor (?) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [01:41:32] RECOVERY - Disk space on cp3040 is OK: DISK OK [01:46:55] 6Operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2030137 (10BBlack) I've fixed the ones that were close (above 79% full on rootfs) for now. I'll keep an eye and we can fix it right tomorrow morning... [01:51:08] 6Operations: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2030139 (10Volans) Reference to https://phabricator.wikimedia.org/T126616 for the SSL logging issue that is filling cpXXXX quickly. In general we need to ensure that there is eno... [02:17:02] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: puppet fail [02:25:56] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.13) (duration: 12m 28s) [02:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Feb 16 02:34:30 UTC 2016 (duration 8m 34s) [02:34:32] 7Blocked-on-Operations, 6Operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2030187 (10GWicke) With recent hardware expansions in eqiad and codfw, we need to expand the RAID-0 on e... [02:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:21] PROBLEM - puppet last run on elastic1007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:45:20] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [03:04:02] RECOVERY - puppet last run on elastic1007 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [03:15:32] (03PS1) 10Andrew Bogott: Added config files for Openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/270891 [03:22:05] 6Operations, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2030207 (10GWicke) > No one should be finding themselves surprised (for example) about the size of SSTables, or wondering how big those are going to get. As a basic rule, both DTCS and STCS... [03:53:48] 6Operations, 6Labs: Manual creation of labs account - https://phabricator.wikimedia.org/T125172#2030209 (10Cobi) Any update on this? [04:00:21] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [04:18:31] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: puppet fail [04:32:41] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [04:45:10] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [04:53:11] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [04:59:02] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [05:07:31] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [05:25:02] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [05:28:40] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [05:40:35] 6Operations, 7audits-data-retention: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2030241 (10ArielGlenn) [06:31:30] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:21] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:11] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:03] 6Operations, 7audits-data-retention: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2030313 (10Joe) I just checked pybal which was the most immediate risk: all files are small enough we're not at risk. I think this is a classic case whe... [06:57:42] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:58:31] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:42] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:21:37] (03PS1) 10Giuseppe Lavagetto: icinga: Add Riccardo to the appropriate contact groups [puppet] - 10https://gerrit.wikimedia.org/r/270894 (https://phabricator.wikimedia.org/T126431) [07:23:07] (03PS2) 10Giuseppe Lavagetto: icinga: Add Riccardo to the appropriate contact groups [puppet] - 10https://gerrit.wikimedia.org/r/270894 (https://phabricator.wikimedia.org/T126431) [07:23:38] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] icinga: Add Riccardo to the appropriate contact groups [puppet] - 10https://gerrit.wikimedia.org/r/270894 (https://phabricator.wikimedia.org/T126431) (owner: 10Giuseppe Lavagetto) [07:36:56] 6Operations, 7audits-data-retention: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2030375 (10Joe) p:5Triage>3High a:3Joe [07:39:06] (03PS1) 10Giuseppe Lavagetto: logrotate: add centralized define, apply to pybal [puppet] - 10https://gerrit.wikimedia.org/r/270895 (https://phabricator.wikimedia.org/T127025) [08:20:52] (03PS1) 10Pmlineditor: Enable assignment of 'accountcreator' for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270897 (https://phabricator.wikimedia.org/T126950) [08:21:35] (03CR) 10Giuseppe Lavagetto: [C: 032] logrotate: add centralized define, apply to pybal [puppet] - 10https://gerrit.wikimedia.org/r/270895 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [08:26:43] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2030410 (10Gehel) a:3Gehel [08:35:25] (03PS1) 10Giuseppe Lavagetto: logrotate: convert nginx to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270898 (https://phabricator.wikimedia.org/T127025) [08:46:20] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2030447 (10Gehel) After talking with Ops, seems that we want to do some testing before uploading new elasticsearch version to our apt repo. L... [08:50:00] I'm looking into encrypting traffic to elasticsearch (https://phabricator.wikimedia.org/T124444). There was a discussion that a local nginx proxy doing SSL termination is NOT a good idea. Can anyone explain why ? [08:50:10] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030460 (10ori) It may not be related to the re-imaging, but it is definitely related to memcached. There is a log bucket on fluorine, `/a/mw-log/memcached-keys`, w... [09:05:42] (03CR) 10Ema: [C: 031] logrotate: convert nginx to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270898 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [09:08:57] <_joe_> gehel: I have no idea tbh [09:09:46] _joe_: might just be a urban legend and in fact everyone agrees that local nginx is fine? [09:10:09] should I also check with mark? [09:10:14] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 696 [09:10:28] <_joe_> I have no strong opinion; in general nginx is well able to manage our production traffic (all of it) [09:10:48] <_joe_> so unless there are specific reasons for it to create problems with elasticsearch [09:11:11] <_joe_> gehel: you should probably do some research and just propose a patch in case [09:11:22] 6Operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2030486 (10ema) [09:11:23] <_joe_> also, open a phab task so that we can discuss the merits :) [09:12:10] nginx seems to me the simple / obvious solution. Phab task already exist (https://phabricator.wikimedia.org/T124444) unless you want a subtask specific to nginx [09:13:00] <_joe_> no that is enough for sure [09:15:14] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 621 [09:18:40] 6Operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2030489 (10Joe) My personal take is that nginx is perfectly fine in general, given it is well able to handle our huge production traffic. It might... [09:19:41] (03PS1) 10Giuseppe Lavagetto: logrotate: convert swift-proxy to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270900 (https://phabricator.wikimedia.org/T127025) [09:19:43] (03PS1) 10Giuseppe Lavagetto: logrotate: convert phd to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270901 (https://phabricator.wikimedia.org/T127025) [09:25:14] RECOVERY - check_mysql on lutetium is OK: Uptime: 3083906 Threads: 1 Questions: 23931236 Slow queries: 58015 Opens: 135055 Flush tables: 3 Open tables: 64 Queries per second avg: 7.760 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:27:35] (03PS2) 10Hoo man: Exclude Sauce Labs IP ranges from rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269944 (https://phabricator.wikimedia.org/T126585) (owner: 10Aude) [09:29:22] (03CR) 10Hoo man: [C: 032] Exclude Sauce Labs IP ranges from rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269944 (https://phabricator.wikimedia.org/T126585) (owner: 10Aude) [09:29:52] (03Merged) 10jenkins-bot: Exclude Sauce Labs IP ranges from rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269944 (https://phabricator.wikimedia.org/T126585) (owner: 10Aude) [09:30:00] (03PS1) 10Hashar: beta: update db script strip output on error [puppet] - 10https://gerrit.wikimedia.org/r/270902 (https://phabricator.wikimedia.org/T110407) [09:31:17] !log hoo@tin Synchronized wmf-config/InitialiseSettings-labs.php: (no message) (duration: 00m 58s) [09:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:40:56] (03CR) 10Volans: logrotate: convert swift-proxy to logrotate::conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270900 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [09:49:34] (03CR) 10Giuseppe Lavagetto: [C: 032] logrotate: convert nginx to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270898 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [09:52:37] !log will cut the wmf branches this afternoon starting around 14:00 CET [09:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:39] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030555 (10elukey) Interesting addendum to what Ori posted: elukey@fluorine:/a/mw-log/archive$ zcat memcached-keys.log-20160209* | awk '{print $4}' | sort | uniq -c... [09:54:04] PROBLEM - puppet last run on db2004 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:21] (03CR) 10Giuseppe Lavagetto: logrotate: convert swift-proxy to logrotate::conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270900 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [10:00:04] kart_ akosiaris: Dear anthropoid, the time has come. Please deploy CXserver to Jessie/SCB (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160216T1000). [10:00:35] akosiaris: here :) [10:00:59] kart_: OK, I am uploading changes then [10:01:31] (03PS5) 10Ema: Maps VCL initial forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [10:02:07] akosiaris: we use cxserver/deploy and some changes are not deployed. Is that fine? [10:02:35] kart_: so, a) we need to have cxserver/deploy working on jessie/node 4.2 [10:02:43] 7Blocked-on-Operations, 10Beta-Cluster-Infrastructure, 6Discovery, 6Release-Engineering-Team, and 2 others: Beta: submodule update reverts new portals commits - https://phabricator.wikimedia.org/T126061#2030588 (10hashar) Thanks @ksmith and @debt. Looks that is a pragmatic compromise. In the first place I... [10:02:50] b) we will need to deploy only on scb100X [10:02:54] b) I will take care of [10:02:58] a) you need to make sure of [10:03:18] akosiaris: aure. [10:03:44] akosiaris: do I have access to scb100x? [10:03:48] Let me check. [10:04:56] I am uploading changes about this as we speak [10:05:30] yes. I can login. Thanks. [10:05:37] (03CR) 10Giuseppe Lavagetto: [C: 032] "actually it seems that precise didn't refuse to use non-root-owned logrotate files, so this is actually a noop." [puppet] - 10https://gerrit.wikimedia.org/r/270900 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [10:05:55] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures [10:08:32] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030607 (10hashar) [10:09:40] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2021279 (10hashar) From 20160209 00:00 until 20160216 23:59: {F3363570 size=full} Note that @ori has asked a few hours ago about delaying the deployment train f... [10:09:57] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030612 (10hashar) [10:13:41] (03PS1) 10Alexandros Kosiaris: cxserver: deploy on SCB [puppet] - 10https://gerrit.wikimedia.org/r/270908 [10:14:17] !log disable puppet, stop salt-minion on sca100{1,2} [10:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:17:46] PROBLEM - salt-minion processes on sca1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:18:05] PROBLEM - salt-minion processes on sca1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:18:16] PROBLEM - salt-minion processes on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:19:03] (03PS1) 10Hoo man: Add Capiunto to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270909 (https://phabricator.wikimedia.org/T126399) [10:19:05] (03PS1) 10Hoo man: Enable Capiunto on test, test2 and testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270910 (https://phabricator.wikimedia.org/T126399) [10:20:44] RECOVERY - puppet last run on db2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:45] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [10:23:21] (03PS1) 10Alexandros Kosiaris: cxserver: Remove from SCA nodes in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/270911 [10:23:23] (03PS1) 10Alexandros Kosiaris: cxserver: Remove from services conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/270912 [10:23:25] (03PS1) 10Alexandros Kosiaris: cxserver: Remove LVS IP from SCA [puppet] - 10https://gerrit.wikimedia.org/r/270913 [10:26:46] (03PS2) 10Alexandros Kosiaris: cxserver: deploy on SCB [puppet] - 10https://gerrit.wikimedia.org/r/270908 [10:26:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: deploy on SCB [puppet] - 10https://gerrit.wikimedia.org/r/270908 (owner: 10Alexandros Kosiaris) [10:32:36] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:33:14] kart_: ok, can you deploy the cxserver/deploy repo ? [10:33:27] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030671 (10hashar) [10:33:27] you should get an ok from 2/4 targets, that's expected [10:33:50] that is, sca1* will NOT be getting the new version, whereas, scb10* will be [10:34:18] akosiaris: ie using tin. Right? [10:34:27] usual deployment? [10:35:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 671 [10:35:24] kart_: yup [10:37:05] syncing [10:37:47] (03PS1) 10Volans: logrotate: convert icinga to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270914 (https://phabricator.wikimedia.org/T127025) [10:38:14] akosiaris: sca1002.eqiad.wmnet: fetch status: 0 [started: 308 mins ago, last-return: 308 mins ago] [10:38:17] OK? [10:38:20] same for 1001 [10:38:40] Should I go ahead? [10:39:31] akosiaris: ^ [10:40:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 972 [10:40:47] yeah [10:40:51] cool [10:41:10] (03PS1) 10Muehlenhoff: Add ferm rule for eventlogging zmq forwarder service [puppet] - 10https://gerrit.wikimedia.org/r/270915 [10:41:21] akosiaris: finished sync. [10:42:04] ok [10:42:17] so scb1001 and 2 are now at Update cxserver to 5b9d909 [10:42:20] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2030681 (10Reedy) >>! In T122697#2030447, @Gehel wrote: > Which means that I'm not sure of how we want to upgrade Vagrant. We could wget the... [10:42:49] yes. cool. [10:43:04] let's see what icinga says [10:44:09] [10:44:09] cxserver endpoints health [10:44:10] OK 2016-02-16 10:42:37 0d 0h 1m 7s 1/3 All endpoints are healthy [10:44:11] cool [10:44:25] ok, then I am pooling scb100* and depooling sca100* gradually [10:45:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 2401615 Threads: 1 Questions: 16635282 Slow queries: 16096 Opens: 5287 Flush tables: 2 Open tables: 401 Queries per second avg: 6.926 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:45:20] akosiaris: nice. [10:46:02] (03PS1) 10Filippo Giunchedi: swiftrepl: finish distutils setup [software] - 10https://gerrit.wikimedia.org/r/270916 [10:46:04] (03PS1) 10Filippo Giunchedi: swiftrepl: new debian version [software] - 10https://gerrit.wikimedia.org/r/270917 [10:46:42] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:46:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: finish distutils setup [software] - 10https://gerrit.wikimedia.org/r/270916 (owner: 10Filippo Giunchedi) [10:46:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: new debian version [software] - 10https://gerrit.wikimedia.org/r/270917 (owner: 10Filippo Giunchedi) [10:49:08] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030703 (10hashar) Since I am lazy tweaking Graphite URL, I created a dashboard in Grafana. It shows the last 7 days of 75 percentiles with marks at 550 ms and 750 m... [10:50:25] !log start swiftrepl commons thumbs for top50 popular size T125791 [10:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:29] 6Operations, 7Availability, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2030704 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLps-BGhQaf1... [10:50:32] (03PS1) 10Volans: logrotate: convert l10nupdate to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270918 (https://phabricator.wikimedia.org/T127025) [10:51:09] jynus: _joe_ : elukey: ori asked to delay the MediaWiki train today because of the 75p saving latency regression that occurred last week. https://phabricator.wikimedia.org/T126700 [10:51:18] doesnt seem to be db related [10:51:34] <_joe_> hashar: based on what? [10:51:40] I am myself fine to delay the train, might need some other thoughts as a reply to ori mail [10:51:46] (03PS2) 10Alexandros Kosiaris: cxserver: Remove from services conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/270912 [10:51:48] (03PS2) 10Alexandros Kosiaris: cxserver: Remove LVS IP from SCA [puppet] - 10https://gerrit.wikimedia.org/r/270913 [10:51:50] (03PS2) 10Alexandros Kosiaris: cxserver: Remove from SCA nodes in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/270911 [10:51:52] (03PS1) 10Alexandros Kosiaris: LVS: move cxserver over to scb [puppet] - 10https://gerrit.wikimedia.org/r/270919 [10:51:56] (03CR) 10Tulsi Bhagat: [C: 031] Enable assignment of 'accountcreator' for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270897 (https://phabricator.wikimedia.org/T126950) (owner: 10Pmlineditor) [10:52:38] (03PS1) 10Muehlenhoff: Add ferm rule for eventlogging mediawiki exception/fatal relay [puppet] - 10https://gerrit.wikimedia.org/r/270920 (https://phabricator.wikimedia.org/T113343) [10:52:40] _joe_: I guess ori point is that the root cause is not known and pushing new code might further delay the investigation or even regress more [10:52:45] (though that can also fix it :D ) [10:52:55] <_joe_> hashar: I agree with ori's point [10:53:11] PROBLEM - RAID on labstore1001 is CRITICAL: CRITICAL: Active: 73, Working: 73, Failed: 1, Spare: 0 [10:54:08] (03PS2) 10Alexandros Kosiaris: LVS: move cxserver over to scb [puppet] - 10https://gerrit.wikimedia.org/r/270919 [10:54:25] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] LVS: move cxserver over to scb [puppet] - 10https://gerrit.wikimedia.org/r/270919 (owner: 10Alexandros Kosiaris) [10:55:01] hashar, everithing I have been commented about doesn't have anything to do with performance issues [10:55:07] *commenting [10:55:22] I have not touched that issue [10:56:03] godog: nice numbers, linked from https://www.mediawiki.org/wiki/Requests_for_comment/Standardized_thumbnails_sizes#Preferred_numbers :) [10:57:35] jynus: oh I only poked you because you did a comment on the task :-} feel free to ignore [10:57:47] _joe_: so to clarify, you agree we should delay the train? [10:58:11] Nemo_bis: hehe nice, thanks! you probably want to look at https://phabricator.wikimedia.org/T125791#2028898 as that's more realistic, I'm amending the ticket to point that out more explicitly [10:58:30] <_joe_> hashar: yes [10:59:00] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [11:00:05] hoo: Respected human, time to deploy Capiunto (testwiki test2wiki and testwikidata) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160216T1100). Please do the needful. [11:00:17] thanks, jouncebot :D [11:00:19] akosiaris: how are we doing? [11:00:54] hoo: me and akosiaris are finising cxserver deployment. but, it won't affect at all. [11:01:08] good to know [11:01:11] thanks [11:01:29] kart_: all LVS have been converted [11:01:32] we are done I think [11:01:39] it's just removing old cruft now [11:01:42] akosiaris: cool. [11:01:51] kart_: do some testing, but I think we are ok [11:01:59] !log installing nettle security updates [11:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:02:13] https://cxserver.wikimedia.org/v1 is OK [11:02:41] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [11:06:56] (03PS1) 10Volans: logrotate: Convert eventlogging-files to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270923 (https://phabricator.wikimedia.org/T127025) [11:08:10] !log repool restbase on restbase1007 [11:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:11:10] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269980 (https://phabricator.wikimedia.org/T126574) (owner: 10Filippo Giunchedi) [11:11:22] (03CR) 10Aude: [C: 031] Add Capiunto to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270909 (https://phabricator.wikimedia.org/T126399) (owner: 10Hoo man) [11:11:43] (03CR) 10Aude: [C: 031] Enable Capiunto on test, test2 and testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270910 (https://phabricator.wikimedia.org/T126399) (owner: 10Hoo man) [11:17:49] kart_: I am gonna remove cxserver cruft from sca in 2-3 days. Just leaving it for now as a fallback [11:19:18] (03CR) 10Hoo man: [C: 032] "Will scap with the change that adds the submodule." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270909 (https://phabricator.wikimedia.org/T126399) (owner: 10Hoo man) [11:19:45] (03Merged) 10jenkins-bot: Add Capiunto to the extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270909 (https://phabricator.wikimedia.org/T126399) (owner: 10Hoo man) [11:20:24] (03CR) 10Aude: Basic "Identifiers" statement section config for Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [11:22:21] (03CR) 10Giuseppe Lavagetto: [C: 031] logrotate: Convert eventlogging-files to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270923 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [11:22:32] (03PS1) 10KartikMistry: CX: Remove the option to override the certificate of Yandex MT client [puppet] - 10https://gerrit.wikimedia.org/r/270927 [11:22:55] (03CR) 10Giuseppe Lavagetto: [C: 031] logrotate: convert l10nupdate to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270918 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [11:24:27] (03CR) 10KartikMistry: "As per, https://gerrit.wikimedia.org/r/#/c/270925/" [puppet] - 10https://gerrit.wikimedia.org/r/270927 (owner: 10KartikMistry) [11:25:11] !log hoo@tin Started scap: Deploy Capiunto (master) to testwiki, test2wiki and testwikidata - T126399 [11:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:25:23] (03CR) 10Giuseppe Lavagetto: [C: 031] "This is a noop give neon is still on precise AFAIR" [puppet] - 10https://gerrit.wikimedia.org/r/270914 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [11:25:56] doh... that log message is wrong, I don't actually enable it with that scap [11:26:41] (03PS1) 10Volans: logrotate: Convert eventlogging to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270928 (https://phabricator.wikimedia.org/T127025) [11:27:50] <_joe_> volans: uhm I guess you just did something wrong [11:28:01] <_joe_> this results as a new changeset [11:28:03] (03CR) 10Volans: "Yes, checked already on neon and the rotation is happening normally although the file's owner/group are wrong" [puppet] - 10https://gerrit.wikimedia.org/r/270914 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [11:28:38] <_joe_> volans: I guess you didn't git review -s your puppet checkout [11:28:55] <_joe_> so the change-id is not added by a hook locally [11:29:07] <_joe_> and you re-reviewed the same change generating a new one [11:29:22] akosiaris: sorry, network issues. We're good. I will ask to deploy one more cxserver/puppet patch later today. [11:29:30] _joe_: are 2 different ones [11:29:35] eventlogging and eventlogging-files [11:29:56] <_joe_> oh ok, [11:29:57] are 2 different logrotate directive [11:30:00] <_joe_> sigh [11:30:06] :) [11:30:16] * mark hands over coffee [11:30:30] (03CR) 10Giuseppe Lavagetto: [C: 031] logrotate: Convert eventlogging to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270928 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [11:30:51] <_joe_> mark: hand me a new pair of glasses :P [11:31:07] lol [11:36:07] (03PS1) 10Volans: logrotate: Convert geoipupdate to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270930 (https://phabricator.wikimedia.org/T127025) [11:36:45] (03CR) 10Aklapper: "GWicke / Ariel: "not ready for merge yet, a bug to be worked out" - if that is still true, feel encouraged to add a [WIP] prefix to the pa" [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [11:38:49] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030788 (10elukey) elukey@mc1009:~$ echo stats | nc 127.0.0.1 11211 ========== DEBIAN============ STAT pid 678 STAT uptime 427654 STAT time 1455621273 STAT version... [11:42:38] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030793 (10Joe) @elukey since those stats are from the startup of the daemon, what is more interesing is the delta of values taken at a 1 hour distance [11:43:41] hashar: thanks for the dashboard! [11:46:55] !log upgrading pfw-codfw to newer JunOS [11:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:49:11] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2030800 (10elukey) @joe: definitely, I'll modify my post with deltas in one hour. [11:49:58] !log hoo@tin Finished scap: Deploy Capiunto (master) to testwiki, test2wiki and testwikidata - T126399 (duration: 24m 46s) [11:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:50:18] (03CR) 10Hoo man: [C: 032] Enable Capiunto on test, test2 and testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270910 (https://phabricator.wikimedia.org/T126399) (owner: 10Hoo man) [11:50:32] 6Operations, 5Patch-For-Review, 7audits-data-retention: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2030802 (10Volans) Varnish ones to be fixed: ``` modules/varnishkafka/manifests/init.pp:36: file { '/etc/logrotate.d/varnishkaf... [11:50:48] (03Merged) 10jenkins-bot: Enable Capiunto on test, test2 and testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270910 (https://phabricator.wikimedia.org/T126399) (owner: 10Hoo man) [11:52:01] (03PS2) 10Giuseppe Lavagetto: logrotate: convert phd to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270901 (https://phabricator.wikimedia.org/T127025) [11:53:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] logrotate: convert phd to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270901 (https://phabricator.wikimedia.org/T127025) (owner: 10Giuseppe Lavagetto) [11:54:04] !log hoo@tin Synchronized wmf-config/: rv (duration: 00m 58s) [11:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:55:07] (03PS2) 10Hoo man: Replace the sidebar link to commons with the commons category [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270701 (https://phabricator.wikimedia.org/T126960) [11:56:18] (03PS1) 10Hoo man: Fix broken require_once for Capiunto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270933 [11:57:11] (03CR) 10Hoo man: [C: 032] Fix broken require_once for Capiunto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270933 (owner: 10Hoo man) [11:57:41] (03PS2) 10Volans: logrotate: convert icinga to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270914 (https://phabricator.wikimedia.org/T127025) [11:57:43] (03Merged) 10jenkins-bot: Fix broken require_once for Capiunto [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270933 (owner: 10Hoo man) [11:59:12] !log hoo@tin Synchronized wmf-config/: Enable Capiunto on testwiki, test2wiki and testwikidata (T126399) (2nd try) (duration: 00m 58s) [11:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:02] !log rebooting pfw-codfw for upgrade [12:01:06] ^^^^^^^^^^^ [12:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:11] ignore alerts and possible pages [12:01:21] (03CR) 10Volans: [C: 032] logrotate: convert icinga to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270914 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [12:01:22] ^^^^^^^^^^^ [12:05:01] PROBLEM - Host mintaka is DOWN: PING CRITICAL - Packet loss = 100% [12:05:41] RECOVERY - Disk space on ms-be1008 is OK: DISK OK [12:06:20] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw-codfw:xe-15/0/0 {#10901} [10Gbps DF]BR [12:07:12] PROBLEM - Host payments2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:18] PROBLEM - Host alnitak is DOWN: PING CRITICAL - Packet loss = 100% [12:08:04] PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:05] PROBLEM - Host bellatrix is DOWN: PING CRITICAL - Packet loss = 100% [12:09:10] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [12:10:17] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [12:15:08] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [12:15:11] !log Restarted hhvm on mw1237 [12:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:17] PROBLEM - check_mysql on payments2001 is CRITICAL: Slave IO: Connecting Slave SQL: Yes Seconds Behind Master: (null) [12:17:40] looks like emails to phabricator smtp are not making it, looking [12:20:36] (03PS1) 10Aude: Take "in other projects" sidebar out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270939 (https://phabricator.wikimedia.org/T103102) [12:20:47] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:21:38] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [12:21:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/3: down - Core: pfw-codfw:xe-6/0/0 {#10900} [10Gbps DF]BR [12:23:08] PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:14] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:23:31] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [12:24:02] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [12:24:42] (03PS2) 10Aude: Take "in other projects" sidebar out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270939 (https://phabricator.wikimedia.org/T103102) [12:25:10] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 1199383 Threads: 1 Questions: 10158672 Slow queries: 5334 Opens: 1618 Flush tables: 2 Open tables: 417 Queries per second avg: 8.469 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:25:10] RECOVERY - check_mysql on payments2001 is OK: Uptime: 1250889 Threads: 3 Questions: 795756 Slow queries: 11 Opens: 146 Flush tables: 1 Open tables: 64 Queries per second avg: 0.636 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:25:20] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 36.82 ms [12:25:27] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 37.32 ms [12:25:33] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [12:25:39] RECOVERY - Host bellatrix is UP: PING OK - Packet loss = 0%, RTA = 36.47 ms [12:25:45] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 42.87 ms [12:25:52] RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 36.54 ms [12:25:58] RECOVERY - Host payments2002 is UP: PING OK - Packet loss = 0%, RTA = 36.45 ms [12:26:16] RECOVERY - Host mintaka is UP: PING OK - Packet loss = 0%, RTA = 36.76 ms [12:26:23] RECOVERY - Host alnitak is UP: PING OK - Packet loss = 0%, RTA = 38.07 ms [12:27:18] RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 37.93 ms [12:28:39] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 58s) [12:28:39] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2030881 (10fgiunchedi) 3NEW [12:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:09] twentyafterfour: ^ [12:30:10] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 2 failures [12:31:08] godog: puppet is still disabled because of the phab deployment problem [12:32:01] I enabled it briefly yesterday, ran it and disabled it again, so I"ll look at the exim issue [12:33:28] apergos: ack, thanks! [12:34:10] Touching the file actually solved the problem... that's kinda scary [12:35:08] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 40 failures [12:35:08] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 2 failures [12:37:24] (03PS2) 10Filippo Giunchedi: swift: let mount_filesystem fail on unmountable fs [puppet] - 10https://gerrit.wikimedia.org/r/269980 (https://phabricator.wikimedia.org/T126574) [12:40:08] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 214 seconds ago with 0 failures [12:40:09] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 2 failures [12:40:16] 2016-02-16 12:39:31 exim 4.82 daemon started: pid=19364, -q10m, listening for SMTP on port 25 (IPv6 and IPv4) [12:40:25] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2030906 (10Aklapper) p:5Triage>3High [12:40:25] I restarted it [12:43:23] apergos: sigh, odd [12:43:37] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2030908 (10ArielGlenn) From yesterday's log I see: 2016-02-15 20:30:57 exim 4.82 daemon started: pid=4631, -q10m, not listening for SMTP This may have occurred during the phab maintenance w... [12:45:08] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 2 failures [12:47:44] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2030910 (10ArielGlenn) Yes, checking the syslog confirms that this is due to the one puppet run done at that time. But why restarting the service 'just worked'... no idea. [12:50:08] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: Puppet has 2 failures [12:53:29] 6Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054#2030914 (10MoritzMuehlenhoff) 3NEW [12:54:12] apergos: doesn't look like we're out of the woods yet [12:54:21] 2016-02-16 12:52:24 unexpected disconnection while reading SMTP command from iridium.eqiad.wmnet (localhost.localdomain) [2620:0:861:103:10:64:32:150]:53328 I=[2620:0:861:3:208:80:154:76]:25 [12:54:47] boo [12:55:09] RECOVERY - check_puppetrun on alnitak is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:55:45] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2030921 (10ArielGlenn) Feb 15 20:31:13 iridium puppet-agent[2616]: (/Stage[main]/Exim4/File[/etc/default/exim4]/content) +++ /tmp/puppet file20160215-2616-1b6m86a#0112016-02-15 20:31:13.66... [13:00:20] godog: where are you seeing that? [13:00:35] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2030935 (10fgiunchedi) likely not related but each mail from iridium also generates an exim mainlog entry for `localhost.localdomain` not found ``` 2016-02-16 12:36:04 no IP address found... [13:01:07] apergos: mx1001 [13:01:21] I assume they are still coming in [13:01:23] ? [13:02:41] tcp 0 0 0.0.0.0:25 0.0.0.0:* LISTEN 19364/exim4 [13:02:44] definitely there [13:08:45] so mx1001 is trying to talk to iridium via ipv6, but afaics port 25 is firewalled for ipv6 but not ipv4 on iridium [13:08:47] godog: I just saw a 250 Accepted go out to mx1001 [13:08:49] and a [13:08:51] oohhhhh [13:09:38] and I was gonna say and a QUIT back but that would explain it [13:09:47] it could have been like that for a long time, I wonder if we would have noticed [13:10:22] ACCEPT tcp -- mx1001.wikimedia.org anywhere tcp dpt:smtp [13:10:30] but I have no idea how to check ipv6 vs v4 [13:11:07] ip6tables vs iptables [13:12:55] 6Operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T127060#2030977 (10fgiunchedi) 3NEW [13:13:53] yeah it times out after 5m and tries ipv4 instead [13:15:55] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2030990 (10fgiunchedi) looks like it is back now, additionally mx1001 / mx2001 ipv6 addresses are firewalled off iridium, in this case exim falls back to ipv4 after 5 minutes ``` root@mx10... [13:19:10] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2031003 (10fgiunchedi) p:5High>3Normal [13:27:49] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [13:28:00] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2031042 (10elukey) Info about how Ori's script gathers data from memcached instances: ``` #!/bin/bash # Log 10 seconds of memcached activity to a file. # Meant to... [13:30:20] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2031048 (10ArielGlenn) here's what's on iridum, from manifests/role/mail.pp role::mail::mx ferm::service { 'exim-smtp': proto => 'tcp', port => '25', } iptable... [13:32:55] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2031058 (10ArielGlenn) ip6tables shows no such similar rule so I guess the ferm rule is wrong. if that's the case other hosts could be impacted too. [13:37:13] apergos: going to lunch, happy to code review patches for T127053 tho [13:37:13] T127053: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053 [13:37:32] nice stashbot. gooood stashbot [13:37:35] :-) [13:38:41] (03PS1) 10BBlack: tlsproxy logrotate: reduce from 52 to 28 days [puppet] - 10https://gerrit.wikimedia.org/r/270947 [13:41:28] (03PS2) 10BBlack: tlsproxy logrotate: reduce from 52 to 28 days [puppet] - 10https://gerrit.wikimedia.org/r/270947 [13:41:50] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy logrotate: reduce from 52 to 28 days [puppet] - 10https://gerrit.wikimedia.org/r/270947 (owner: 10BBlack) [13:42:58] (03PS2) 10BBlack: All DYNA standardized to TTL=600 [dns] - 10https://gerrit.wikimedia.org/r/270281 [13:48:33] (03CR) 10BBlack: [C: 032] All DYNA standardized to TTL=600 [dns] - 10https://gerrit.wikimedia.org/r/270281 (owner: 10BBlack) [13:53:20] (03PS2) 10BBlack: VCL: drop default ttl_cap to 21 days [puppet] - 10https://gerrit.wikimedia.org/r/269968 (https://phabricator.wikimedia.org/T124954) [13:55:24] (03CR) 10BBlack: [C: 032] VCL: drop default ttl_cap to 21 days [puppet] - 10https://gerrit.wikimedia.org/r/269968 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [13:56:08] PROBLEM - swift-account-replicator on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:56:09] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [13:56:29] PROBLEM - swift-object-updater on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:56:29] PROBLEM - swift-object-server on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [13:56:29] PROBLEM - swift-container-auditor on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:56:39] PROBLEM - swift-container-updater on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:56:58] PROBLEM - swift-container-replicator on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:57:00] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures [13:57:09] PROBLEM - swift-object-replicator on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:57:18] PROBLEM - swift-account-auditor on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:57:19] PROBLEM - swift-account-server on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:57:38] PROBLEM - swift-account-reaper on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [13:57:49] PROBLEM - swift-container-server on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [13:57:49] PROBLEM - swift-object-auditor on ms-be1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [13:58:53] db errors? [13:59:37] ? [13:59:57] I'm checking, unrelated to switches [14:00:24] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2031132 (10ArielGlenn) nope, looking at wrong tab. here's the class on iridium. ferm/conf.d/10_phabmain-smtp # Autogenerated by puppet. DO NOT EDIT BY HAND! # # &R_SERVICE(tcp, 25, (@re... [14:00:29] switches? [14:00:37] whatever you are rebooting [14:00:39] :-) [14:00:45] I'm not rebooting anything [14:00:51] ah, sorry [14:03:29] I think it is only mediawiki depooling some old servers that could not keep up [14:03:50] there seems to be some imports going on [14:05:00] yeah, some heavy api queries [14:10:39] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [14:19:22] (03PS6) 10Ema: Maps VCL initial forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [14:20:13] (03PS1) 10BBlack: ttl_fixed: limit to tier-1 backends [puppet] - 10https://gerrit.wikimedia.org/r/270957 [14:22:23] (03CR) 10Ottomata: [C: 031] "Wow, somehow, I had no idea this was a thing! UmmMmmm, yeah sure, +1, but we should probably move this to another host than eventlog1001." [puppet] - 10https://gerrit.wikimedia.org/r/270920 (https://phabricator.wikimedia.org/T113343) (owner: 10Muehlenhoff) [14:23:26] (03PS1) 10ArielGlenn: exim should listen to smtp on ipv4 and ipv6 for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) [14:24:22] (03CR) 10ArielGlenn: "no idea if this is right syntax or even the right way to do it, just copy-paste from somewhere else for the moment" [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [14:28:20] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [14:29:56] 7Blocked-on-Operations, 6Operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2031262 (10Eevans) [14:34:00] RECOVERY - swift-object-updater on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:34:05] that's me ^ [14:34:08] RECOVERY - swift-object-server on ms-be1008 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:34:08] RECOVERY - swift-container-auditor on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:34:10] RECOVERY - swift-container-updater on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:34:30] RECOVERY - swift-container-replicator on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:34:49] RECOVERY - swift-object-replicator on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:34:58] RECOVERY - swift-account-auditor on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:34:59] RECOVERY - swift-account-server on ms-be1008 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:35:18] RECOVERY - swift-account-reaper on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:35:29] RECOVERY - swift-container-server on ms-be1008 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:35:29] RECOVERY - swift-object-auditor on ms-be1008 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:35:38] RECOVERY - swift-account-replicator on ms-be1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:35:49] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [14:35:49] PROBLEM - puppet last run on elastic1005 is CRITICAL: CRITICAL: Puppet has 1 failures [14:38:10] (03CR) 10Ottomata: Add ferm rule for eventlogging zmq forwarder service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270915 (owner: 10Muehlenhoff) [14:41:05] 7Puppet, 7Ruby: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#2031315 (10hashar) 5Open>3Resolved Bulk of rubocop issues have been fixed. A lot of changes have been abandoned but rubocop is running nonetheless. When there is interest in being rubocop... [14:41:16] 6Operations, 6Phabricator: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2031318 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/270960/ why did this not show up on the ticket when the task number is in the commit? Anyways.. [14:42:03] (03CR) 10Cenarium: ">Oh, you mean having either of those groups should prevent you from being autopromoted?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [14:45:48] (03CR) 10Ema: [C: 031] "LGTM and to pcc:" [puppet] - 10https://gerrit.wikimedia.org/r/270957 (owner: 10BBlack) [14:46:19] (03CR) 10Filippo Giunchedi: [C: 04-1] exim should listen to smtp on ipv4 and ipv6 for phabricator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [14:48:51] 6Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#2031357 (10Aklapper) Any news here? Still happening, still "high priority"? [14:49:20] (03PS1) 10Ottomata: Remove {templates,files}/hadoop files [puppet] - 10https://gerrit.wikimedia.org/r/270963 [14:49:35] (03PS2) 10ArielGlenn: phabricator: allow port 25 for ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) [14:50:11] (03CR) 10BBlack: [C: 032] ttl_fixed: limit to tier-1 backends [puppet] - 10https://gerrit.wikimedia.org/r/270957 (owner: 10BBlack) [14:51:03] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [14:51:15] (03PS2) 10Ottomata: Remove {templates,files}/hadoop files [puppet] - 10https://gerrit.wikimedia.org/r/270963 [14:51:21] (03CR) 10Ottomata: [C: 032 V: 032] Remove {templates,files}/hadoop files [puppet] - 10https://gerrit.wikimedia.org/r/270963 (owner: 10Ottomata) [14:52:01] (03CR) 10ArielGlenn: phabricator: allow port 25 for ipv6 too (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [14:53:09] PROBLEM - OTRS SMTP on mendelevium is CRITICAL: Connection refused [14:53:11] (03CR) 10ArielGlenn: "for where this comes from, see" [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [14:53:29] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [24.0] [14:54:58] RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.012 sec. response time [14:58:53] !log rebooting nescio [14:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:59:39] ACKNOWLEDGEMENT - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdj broken [15:00:04] hoo aude: Dear anthropoid, the time has come. Please deploy Wikidata configuration changes (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160216T1500). [15:00:35] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2031416 (10elukey) I tried to re-run my calculations for mc1007 and mc1005 (one hour timespan) as above but I didn't see the huge discrepancy that ori spotted with m... [15:00:49] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [15:01:25] * aude hides :P [15:02:30] RECOVERY - puppet last run on elastic1005 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:04:05] (03PS1) 10Aude: Enable external identifier data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270965 (https://phabricator.wikimedia.org/T125633) [15:09:13] (03CR) 10Aude: [C: 032] Replace the sidebar link to commons with the commons category [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270701 (https://phabricator.wikimedia.org/T126960) (owner: 10Hoo man) [15:09:29] !log upgrades OTRS to 5.0.7 [15:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:33] (03Merged) 10jenkins-bot: Replace the sidebar link to commons with the commons category [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270701 (https://phabricator.wikimedia.org/T126960) (owner: 10Hoo man) [15:13:02] (03CR) 10JanZerebecki: [C: 031] Enable external identifier data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270965 (https://phabricator.wikimedia.org/T125633) (owner: 10Aude) [15:13:32] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Link commons sidebar link to commons category (duration: 00m 59s) [15:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:20] looks good [15:15:12] (03CR) 10Aude: [C: 032] Basic "Identifiers" statement section config for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [15:16:01] (03Merged) 10jenkins-bot: Basic "Identifiers" statement section config for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [15:17:55] !log aude@tin Synchronized wmf-config/Wikibase.php: Enable identifiers section on Wikidata (duration: 01m 00s) [15:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:36] https://test.wikidata.org/wiki/Q22 looks good :) [15:19:57] (03PS1) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 [15:20:56] (03PS1) 10Filippo Giunchedi: swift: run swift-drive-audit staggered once a day [puppet] - 10https://gerrit.wikimedia.org/r/270970 (https://phabricator.wikimedia.org/T126574) [15:21:58] (03CR) 10Aude: [C: 032] Enable external identifier data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270965 (https://phabricator.wikimedia.org/T125633) (owner: 10Aude) [15:22:39] (03Merged) 10jenkins-bot: Enable external identifier data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270965 (https://phabricator.wikimedia.org/T125633) (owner: 10Aude) [15:24:18] (03PS3) 10MarcoAurelio: Adding WP and WT as namespace aliases for tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269970 (https://phabricator.wikimedia.org/T126604) [15:24:35] !log aude@tin Synchronized wmf-config/Wikibase-production.php: Enable external identifier data type on Wikidata (duration: 00m 57s) [15:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:10] PROBLEM - Disk space on ms-be2017 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdm1 is not accessible: Input/output error [15:28:29] (03CR) 10JanZerebecki: [C: 031] Take "in other projects" sidebar out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270939 (https://phabricator.wikimedia.org/T103102) (owner: 10Aude) [15:28:37] ok :) [15:29:16] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2031518 (10Ottomata) Bump! [15:29:40] (03CR) 10Aude: [C: 032] Take "in other projects" sidebar out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270939 (https://phabricator.wikimedia.org/T103102) (owner: 10Aude) [15:30:27] (03Merged) 10jenkins-bot: Take "in other projects" sidebar out of beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270939 (https://phabricator.wikimedia.org/T103102) (owner: 10Aude) [15:30:49] paravoid: I'm going to merge https://gerrit.wikimedia.org/r/#/c/267262/2 unless there are objections? [15:32:19] !log aude@tin Synchronized wmf-config/Wikibase.php: Enable in other projects sidebar feature by default, out of beta (duration: 00m 58s) [15:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:50] godog: well jynus raised the valid point (on task) that we don't really need it [15:32:54] (03PS2) 10MarcoAurelio: Termporary lift of IP cap for an Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270646 (https://phabricator.wikimedia.org/T126939) [15:33:12] and that cciss_vol_status does everything we need nowadays [15:33:38] I think it's up to whoever is going to write that RAID check :) [15:33:43] I would -1 that, but I have not implemented the check to do one or the other [15:33:46] 6Operations, 10ops-eqiad: Failed drive in labstore1001 array - https://phabricator.wikimedia.org/T127076#2031522 (10chasemp) 3NEW a:3Cmjohnson [15:34:03] so I cannot vote :-) [15:34:13] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Remove other projects sidebar beta feature and per wiki settings (duration: 01m 09s) [15:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:27] paravoid: ah, can it also manipulate drives? I'd need that for swift to offline disks for example cc jynus [15:34:49] ACKNOWLEDGEMENT - RAID on labstore1001 is CRITICAL: CRITICAL: Active: 73, Working: 73, Failed: 1, Spare: 0 cpettet https://phabricator.wikimedia.org/T127076 [15:35:13] godog, I cannot say for sure, only investigated for monitoring a bit [15:35:24] doubt it [15:36:17] * aude done [15:36:52] ok, looks like we need it to change controller status but monitoring is fine with gpl tools [15:37:25] (03PS2) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 [15:41:45] (03PS3) 10MarcoAurelio: Cleanup: removing expired event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270648 [15:42:24] 6Operations, 10ops-eqiad, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#2031555 (10Papaul) a:5Papaul>3Cmjohnson This server is in Eqiad so assigning the ticket to Chris [15:45:26] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2031560 (10Gehel) [15:46:51] 6Operations, 10ops-codfw: ms-be2016.codfw.wmnet: slot=0 dev=sdi failed - https://phabricator.wikimedia.org/T126630#2031565 (10Papaul) @fgiunchedi you mentioned slot 1 on the task, but I also see a failed drive on slot 6. Can you please verify and confirm. Thanks [15:47:42] (03PS1) 10Volans: Repool of db1022 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270975 (https://phabricator.wikimedia.org/T120122) [15:48:41] (03PS2) 10Volans: Repool of db1022 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270975 (https://phabricator.wikimedia.org/T120122) [15:50:30] RECOVERY - cassandra-b CQL 10.64.0.231:9042 on restbase1007 is OK: TCP OK - 0.001 second response time on port 9042 [15:52:23] (03PS2) 10Giuseppe Lavagetto: logrotate: convert l10nupdate to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270918 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [15:52:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] logrotate: convert l10nupdate to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270918 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [15:54:13] !log running `nodetool cleanup' on restbase{1,2,7-a}.eqiad.wmnet (bootstrap of 1007-b now complete) [15:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:24] (03CR) 10Jcrespo: [C: 04-1] Repool of db1022 after maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270975 (https://phabricator.wikimedia.org/T120122) (owner: 10Volans) [15:55:02] ^this is more of a lol than anything. I am not that a jerk. [15:55:39] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:55:45] lol [15:57:23] (03PS3) 10Volans: Repool of db1022 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270975 (https://phabricator.wikimedia.org/T120122) [15:57:47] there you go :) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160216T1600). [16:00:04] Dereckson Krenair bmansurov kart_ mafk andrewbogott: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:21] what happened to ‘please do the needful’? [16:00:23] (03CR) 10Jcrespo: [C: 031] Repool of db1022 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270975 (https://phabricator.wikimedia.org/T120122) (owner: 10Volans) [16:00:34] * mafk will be avalaible during the process [16:01:30] andrewbogott, I think it is randomized [16:01:43] I can SWAT. Dereckson ping for SWAT [16:01:43] ah, in that case I approve :) [16:02:24] thcipriani: so my weird thing is that I rolled out a bunch of change on mira but didn’t do a full scap. When the deployment host switched from mira to tin, all those patches were dropped... [16:02:41] thcipriani: I'm here too. [16:02:49] so a simple rebase of the branch and project update should get me back where I want to be [16:02:58] RECOVERY - Disk space on ms-be2017 is OK: DISK OK [16:03:18] anomie: ping - that session error sh-t happening again :( [16:03:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270646 (https://phabricator.wikimedia.org/T126939) (owner: 10MarcoAurelio) [16:03:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270648 (owner: 10MarcoAurelio) [16:04:01] (03Merged) 10jenkins-bot: Termporary lift of IP cap for an Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270646 (https://phabricator.wikimedia.org/T126939) (owner: 10MarcoAurelio) [16:04:31] (03Merged) 10jenkins-bot: Cleanup: removing expired event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270648 (owner: 10MarcoAurelio) [16:04:40] andrewbogott: huh, that is strange, sync-masters should have ran even if you just sync'd a file... [16:05:01] thcipriani: I think I missed a step. I did sync-common on silver which got the changes where I wanted them [16:05:27] andrewbogott: ah, yeah, that wouldn't trigger sync-masters. Okie doke. [16:05:27] right, that's a corner-ish case [16:06:38] it wouldn’t have mattered at all, but for the mira->tin switch [16:07:27] here [16:08:56] !log thcipriani@tin Synchronized wmf-config/throttle.php: Termporary lift of IP cap for an Edit-a-thon [[gerrit:270646]], Cleanup: removing expired event [[gerrit:270648]] (duration: 00m 58s) [16:08:59] ^ mafk throttles sync'd [16:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:16] thcipriani: thanks, hope it works [16:09:31] mafk: yw, me too :) [16:09:36] andrewbogott: so what do you need me to do? [16:10:06] andrewbogott: ah, I see, Openstack manager submodule is behind on wmf.13, correct? [16:10:19] thcipriani: right [16:10:33] that should be it, no need to merge patches [16:10:38] since they’re all merged already [16:11:09] andrewbogott: yup, needs a sync-dir, too? [16:11:29] yeah, I’d expect so [16:12:19] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270344 (https://phabricator.wikimedia.org/T125946) (owner: 10Jhobs) [16:12:43] !log thcipriani@tin Synchronized php-1.27.0-wmf.13/extensions/OpenStackManager: SWAT: rebase openstack submodule (duration: 00m 58s) [16:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:51] (03PS7) 10Ema: Maps VCL initial forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [16:12:52] ^ andrewbogott should be up-t-date now [16:13:10] thcipriani: yep, looks better, thank you [16:13:23] (03Merged) 10jenkins-bot: Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270344 (https://phabricator.wikimedia.org/T125946) (owner: 10Jhobs) [16:13:30] andrewbogott: awesome, thanks for checking. [16:13:57] yeah, now logins take me 5 seconds instead of 5 minutes. Much better! [16:14:29] blerg error: insufficient permission for adding an object to repository database .git/objects [16:14:58] it looks like some of the /srv/mediawiki-staging/.git/objects subdirectories are owned by root:root [16:15:17] ^ can I get an opsen to intervene so I can continue SWAT? [16:16:14] wait...looks like I can fetch now for some reason. still see a handful of subdirs with root:root though... [16:16:37] 6Operations, 10ops-codfw: ms-be2016.codfw.wmnet: slot=0 dev=sdi failed - https://phabricator.wikimedia.org/T126630#2031710 (10fgiunchedi) thanks @papaul, indeed the reporting is wrong, I've blinked the led for the logical drive 9 ``` array I Logical Drive: 9 Size: 3.6 TB Fault Toler... [16:17:58] bmansurov: syncing your change now, FYI. [16:18:03] ok [16:18:16] thcipriani: I wonder if it'd be possible to swat too https://gerrit.wikimedia.org/r/#/c/269970/ ? [16:18:27] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable survey at reduced sample rate [[gerrit:270344]] (duration: 01m 00s) [16:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:32] if not, I've scheduled for tomorrow [16:19:32] bmansurov: lots of fatals reverting [16:19:41] (03PS1) 10Volans: database: Reimage db1021 with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/270983 (https://phabricator.wikimedia.org/T126996) [16:19:46] thcipriani: ok [16:19:49] Argument 1 passed to QuickSurveys\SurveyFactory::factory() must be an instance of array, bool given [16:20:09] PROBLEM - HHVM rendering on mw2041 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.219 second response time [16:20:10] PROBLEM - HHVM rendering on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.082 second response time [16:20:10] PROBLEM - HHVM rendering on mw2169 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.211 second response time [16:20:10] PROBLEM - HHVM rendering on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.053 second response time [16:20:10] PROBLEM - HHVM rendering on mw1115 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50632 bytes in 0.107 second response time [16:20:10] PROBLEM - HHVM rendering on mw2178 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.213 second response time [16:20:10] PROBLEM - HHVM rendering on mw2123 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.218 second response time [16:20:11] PROBLEM - HHVM rendering on mw2126 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.220 second response time [16:20:18] PROBLEM - HHVM rendering on mw1103 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.076 second response time [16:20:18] PROBLEM - HHVM rendering on mw1048 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.076 second response time [16:20:18] PROBLEM - HHVM rendering on mw1112 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.087 second response time [16:20:18] PROBLEM - HHVM rendering on mw1225 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.057 second response time [16:20:18] PROBLEM - HHVM rendering on mw1071 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.077 second response time [16:20:19] PROBLEM - HHVM rendering on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.064 second response time [16:20:19] PROBLEM - HHVM rendering on mw1155 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50608 bytes in 0.082 second response time [16:20:20] PROBLEM - HHVM rendering on mw1054 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.078 second response time [16:20:20] PROBLEM - HHVM rendering on mw1049 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.075 second response time [16:20:21] PROBLEM - HHVM rendering on mw2088 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50608 bytes in 0.228 second response time [16:20:21] PROBLEM - HHVM rendering on mw2051 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.213 second response time [16:20:22] PROBLEM - HHVM rendering on mw1234 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50632 bytes in 0.101 second response time [16:20:25] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Enable survey at reduced sample rate" (duration: 01m 01s) [16:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:50] uh [16:20:52] ^ that should fix the HTTP Critical [16:21:08] thcipriani: are permissions still messed up on tin? I can help if you know what you need [16:21:08] PROBLEM - HHVM rendering on mw1195 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.083 second response time [16:21:09] PROBLEM - HHVM rendering on mw1147 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.090 second response time [16:21:16] <_joe_> this looks pretty bad [16:21:18] PROBLEM - HHVM rendering on mw1230 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.067 second response time [16:21:20] what's up? [16:21:22] andrewbogott: nope, should be fixed [16:21:27] <_joe_> paravoid: bad deploy [16:21:29] ok [16:21:31] <_joe_> bad code I mean [16:21:39] PROBLEM - HHVM rendering on mw1218 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.082 second response time [16:21:48] PROBLEM - HHVM rendering on mw1251 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.065 second response time [16:21:58] PROBLEM - HHVM rendering on mw1123 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.073 second response time [16:21:58] PROBLEM - HHVM rendering on mw1146 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.073 second response time [16:21:59] RECOVERY - HHVM rendering on mw2041 is OK: HTTP OK: HTTP/1.1 200 OK - 68764 bytes in 0.264 second response time [16:21:59] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 68756 bytes in 0.103 second response time [16:21:59] PROBLEM - HHVM rendering on mw1081 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.075 second response time [16:22:00] RECOVERY - HHVM rendering on mw2169 is OK: HTTP OK: HTTP/1.1 200 OK - 68763 bytes in 0.253 second response time [16:22:00] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 68756 bytes in 0.089 second response time [16:22:00] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 68758 bytes in 0.136 second response time [16:22:00] RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 68765 bytes in 0.281 second response time [16:22:01] RECOVERY - HHVM rendering on mw2126 is OK: HTTP OK: HTTP/1.1 200 OK - 68764 bytes in 0.259 second response time [16:22:01] RECOVERY - HHVM rendering on mw2178 is OK: HTTP OK: HTTP/1.1 200 OK - 68765 bytes in 0.337 second response time [16:22:02] <_joe_> thcipriani: what's the situation? [16:22:02] RECOVERY - HHVM rendering on mw1103 is OK: HTTP OK: HTTP/1.1 200 OK - 68757 bytes in 0.121 second response time [16:22:05] quicksurvey deploy Argument 1 passed to QuickSurveys\SurveyFactory::factory() must be an instance of array, bool given merged, deployed, now reverted [16:22:14] oh [16:22:15] _joe_: should now be fixed. [16:22:19] PROBLEM - HHVM rendering on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.052 second response time [16:22:24] why wasn't it caught in QA? [16:22:25] <_joe_> yeah let me check some hosts [16:22:49] PROBLEM - HHVM rendering on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50631 bytes in 0.080 second response time [16:23:02] :( [16:23:04] <_joe_> thcipriani: you need to probably deploy again [16:23:20] _joe_: same deploy? [16:23:22] <_joe_> I have several machines still spitting 500s [16:23:23] (03CR) 10Ottomata: [C: 031] Add ferm rule for eventlogging zmq forwarder service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270915 (owner: 10Muehlenhoff) [16:23:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [16:23:35] <_joe_> thcipriani: yeah or I just restart those [16:23:35] _joe_: kk, going. [16:23:50] how bad was it? only a percentage of requests? [16:24:11] <_joe_> jynus: 100% AFAICS [16:24:15] oh [16:24:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Enable survey at reduced sample rate" (duration: 00m 58s) [16:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:39] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 68756 bytes in 0.083 second response time [16:24:48] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [16:24:49] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 68766 bytes in 0.178 second response time [16:24:50] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 68758 bytes in 0.208 second response time [16:24:50] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 68756 bytes in 0.100 second response time [16:24:53] <_joe_> jynus: actually no, thcipriani has been very fast [16:24:58] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [1000.0] [16:24:59] _joe_: 2nd sync done. [16:25:04] <_joe_> thcipriani: thanks [16:25:13] so far in reqstats for text, it looks like 0.5% (rate of 503 spike vs rate of normal requests), but stats still coming in [16:25:18] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 68757 bytes in 0.110 second response time [16:25:28] RECOVERY - HHVM rendering on mw1251 is OK: HTTP OK: HTTP/1.1 200 OK - 68756 bytes in 0.081 second response time [16:25:30] RECOVERY - HHVM rendering on mw1123 is OK: HTTP OK: HTTP/1.1 200 OK - 68758 bytes in 0.121 second response time [16:25:36] <_joe_> bblack: yeah exactly [16:25:38] RECOVERY - HHVM rendering on mw1146 is OK: HTTP OK: HTTP/1.1 200 OK - 68758 bytes in 0.140 second response time [16:25:39] RECOVERY - HHVM rendering on mw1081 is OK: HTTP OK: HTTP/1.1 200 OK - 68758 bytes in 0.131 second response time [16:25:50] (03PS1) 10Bmansurov: Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) [16:25:59] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 68756 bytes in 0.102 second response time [16:26:04] thcipriani: did you revert, or could you merge a follow up? https://gerrit.wikimedia.org/r/#/c/270985/ [16:26:12] the patch was supposedly for 0.05% if I read it correctly though [16:26:29] bmansurov: I reverted, I haven't pushed up the patch yet. [16:26:45] but anyways, the stats sample rate is too slow for how fast that was reverted, so we won't see the true rate there [16:26:47] thcipriani: ok, i'll rebase my patch once you're done [16:26:51] <_joe_> bblack: well the wikipedia main page and barack obama both thailed [16:26:53] (looks more like 1% now) [16:26:54] <_joe_> *failed [16:26:57] (03CR) 10Bmansurov: "We missed some required keys: https://gerrit.wikimedia.org/r/#/c/270985/1/wmf-config/InitialiseSettings.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270344 (https://phabricator.wikimedia.org/T125946) (owner: 10Jhobs) [16:27:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [16:27:42] it's 2016 and we still get outages for pushing broken code -- this happened again fairly recently [16:27:51] 16:19-16:25 ? [16:27:52] I'm puzzled on why this wasn't caught during QA [16:28:06] so, can we get a postmortem in any case? :) [16:28:43] I've got :19 -> :22 in varnish response stats for the total curve of the spike (all non-normal values inside that time window) [16:28:58] <_joe_> paravoid: I guess it was a wrong change to mediawiki-config [16:29:06] <_joe_> for which almost no one writes tests [16:29:07] 6Operations, 10ops-codfw: ms-be2016.codfw.wmnet: slot=0 dev=sdi failed - https://phabricator.wikimedia.org/T126630#2031766 (10Papaul) @fgiunchedi Thanks I will go ahead and request a replacement driver. [16:29:14] <_joe_> and yes, that is a problem [16:29:22] say :18 -> :23 since we don't know when during the minute it ramped in and out [16:30:34] 6Operations, 6Project-Admins: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2031774 (10faidon) >>! In T119944#2007678, @Aklapper wrote: > Proposal looks good to me so feel free to go ahead. (Not sure "what the rest" means here, if there are specific points pleas... [16:30:59] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:31:05] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1455636632214&to=1455640232214&var-site=All&var-cache_type=text&var-status_type=5 [16:31:18] ^ is the data showing the 18-23-ish window of 503 response [16:31:30] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:32:18] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:32:24] (keep in mind cache hits would be unaffected, so maybe the peak %-of-all-requests is roughly accurate there at 1-2% ish) [16:32:28] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:32:29] (03PS1) 10Thcipriani: Revert "Enable survey at reduced sample rate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270986 [16:33:14] 6Operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#2031789 (10csteipp) @Ottomata, what would need to happen to trial either the ldap solution, or creating user accounts in the db itself? @jcrespo, do you know if the db serve... [16:33:23] from the varnish-http-errors dashboard looks more 19-27 [16:33:44] thcipriani: fast fingers, thanks [16:33:51] (03CR) 10Jcrespo: [C: 031] "Good." [puppet] - 10https://gerrit.wikimedia.org/r/270983 (https://phabricator.wikimedia.org/T126996) (owner: 10Volans) [16:34:39] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:34:45] fwiw, sync'd the initial patch at 8:18:27, sync'd the revert at 8:20:25 [16:34:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:34:55] indeed, thanks for the prompt response [16:35:26] 6Operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#2031797 (10Ottomata) > what would need to happen to trial either the ldap solution, don't know much about it... > creating user accounts in the db itself? This could be done... [16:35:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270986 (owner: 10Thcipriani) [16:36:10] volans: yeah but that dashboard's queries are using some movingaverage type stuff... [16:36:18] (03PS2) 10Giuseppe Lavagetto: logrotate: Convert eventlogging-files to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270923 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [16:36:24] 7Puppet, 10Beta-Cluster-Infrastructure, 5Patch-For-Review, 7Tracking: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2031799 (10greg) Please do not change to a goal. The reverse dependencies, as @aklapper pointed out, are very important. [16:36:38] (03Merged) 10jenkins-bot: Revert "Enable survey at reduced sample rate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270986 (owner: 10Thcipriani) [16:36:43] whereas the varnish-aggregate-client-status-codes one doesn't try to average on the graphite queries, it's just raw data [16:36:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] logrotate: Convert eventlogging-files to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270923 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [16:36:57] ok, good to know [16:37:30] (03PS2) 10Giuseppe Lavagetto: logrotate: Convert eventlogging to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270928 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [16:37:47] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] logrotate: Convert eventlogging to logrotate::conf [puppet] - 10https://gerrit.wikimedia.org/r/270928 (https://phabricator.wikimedia.org/T127025) (owner: 10Volans) [16:40:12] (03PS2) 10Volans: database: Reimage db1021 with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/270983 (https://phabricator.wikimedia.org/T126996) [16:42:14] 6Operations, 10ops-codfw: ms-be2017.codfw.wmnet: dev=sdm failed - https://phabricator.wikimedia.org/T127089#2031834 (10fgiunchedi) 3NEW [16:42:24] papaul: ^ [16:42:31] (03CR) 10Volans: [C: 032] database: Reimage db1021 with Jessie [puppet] - 10https://gerrit.wikimedia.org/r/270983 (https://phabricator.wikimedia.org/T126996) (owner: 10Volans) [16:42:50] godog: ok [16:45:21] thcipriani: how are we doing? [16:45:40] kart__: oh, didn't realize you were around :) [16:45:57] I can get your patch out the door. [16:46:21] Thanks :) [16:46:43] I was waiting and then 5xx apears, so didn't ping. [16:47:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267236 (owner: 10KartikMistry) [16:48:15] Hi. [16:48:44] Dereckson: hiya [16:49:08] (03Merged) 10jenkins-bot: CX: Remove ContentTranslationCorpora setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267236 (owner: 10KartikMistry) [16:49:34] thcipriani: sorry to be late, irssi didn't notify me as I had my window scrolled up and I didn't realize until now it were SWAT time. [16:51:45] (03PS2) 10Bmansurov: Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) [16:52:07] thcipriani: this is the fixed patch ^ [16:52:51] 6Operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#2031903 (10jcrespo) The idea would be to manage the grants on the server, use the LDAP passwords. That is possible. In theory, all eventlogging-related hosts have SSL deploy... [16:52:58] Dereckson: I had my hands full anyway :) [16:52:59] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Remove ContentTranslationCorpora setting PART I [[gerrit:267236]] (duration: 00m 58s) [16:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:12] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: CX: Remove ContentTranslationCorpora setting PART II [[gerrit:267236]] (duration: 00m 59s) [16:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:19] ^ kart__ sync'd [16:54:44] Checking. [16:56:42] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270679 (https://phabricator.wikimedia.org/T126914) (owner: 10Dereckson) [16:56:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270456 (https://phabricator.wikimedia.org/T112500) (owner: 10Dereckson) [16:57:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270556 (https://phabricator.wikimedia.org/T125068) (owner: 10Dereckson) [16:57:43] godog: apergos: is puppet swat happening? [16:57:45] (03Merged) 10jenkins-bot: Namespace configuration on ja.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270679 (https://phabricator.wikimedia.org/T126914) (owner: 10Dereckson) [16:57:56] soon as regular swat is done, yep, mobrovac [16:57:59] mobrovac: it is [16:58:04] kk cool [16:58:05] get on the calendar, you're first up [16:58:06] Dereckson: probably push all your changes at the same time if that's ok with you. [16:58:10] Ok. [16:58:15] apergos: :) [16:58:34] (03Merged) 10jenkins-bot: Remove *.ggpht.com from Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270456 (https://phabricator.wikimedia.org/T112500) (owner: 10Dereckson) [16:59:21] (03Merged) 10jenkins-bot: Don't index NS_USER on cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270556 (https://phabricator.wikimedia.org/T125068) (owner: 10Dereckson) [17:00:04] godog apergos: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160216T1700). [17:00:20] ACKNOWLEDGEMENT - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdm broken [17:00:33] thcipriani: let us know when SWAT is over [17:00:41] apergos: will do [17:01:05] last sync happening now [17:01:30] thcipriani: we're good. I can see entries logged in cx_corpora. [17:01:42] kart__: cool, thank you for checking. [17:01:55] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:270556]] [[gerrit:270679]] [[gerrit:270556]] (duration: 00m 59s) [17:01:57] ^ Dereckson check please [17:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:01] Testing. [17:02:46] thcipriani, want me to bump my one to the evening swat? [17:02:52] oh, wait [17:03:02] I think we want to get it done before the wmf.14 rollout, probably [17:03:10] wmf.14 prolly won't roll out today. [17:03:16] ok then [17:03:20] thcipriani: the three works [17:03:24] Dereckson: thank you. [17:03:26] Thanks for the deploy. [17:03:31] apergos: SWAT done [17:03:35] thanks [17:03:54] mobrovac: you add yourself to the deployments page yet? [17:04:13] apergos: not yet, writing the commit msg of the patch as we speak [17:04:18] :-) [17:04:20] will get there in 2 mins [17:04:23] Krenair: if you could move it that'd be great, thank you. Sorry I missed yours :( [17:05:57] (03PS1) 10Mobrovac: Mathoid: Add the render_no_check config option [puppet] - 10https://gerrit.wikimedia.org/r/270995 [17:07:01] !log Disabled puppet on db1021 for reimaging: T126996 [17:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:21] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2031988 (10RobH) [17:08:39] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [17:08:52] apergos: i'm there - https://wikitech.wikimedia.org/wiki/Deployments#Tuesday.2C.C2.A0February.C2.A016 :) [17:09:05] mobrovac: what's the 'speech_on' thing? [17:09:08] that got changed [17:09:21] apergos: yes, config variable rename [17:09:26] same thing, different name [17:10:23] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2032005 (10RobH) a:3RobH Floruine's Stats: * Dell PowerEdge R310 * Single Intel X3450 * 16GB RAM : 4 * 2GB DIMM DDR3 Synchronous 1333 MHz *... [17:10:49] mobrovac: LGTM cc apergos [17:12:26] mobrovac: it will auto restart mathoid by itself iirc? [17:12:31] where does that input check and normalization happen now, mobrovac, at restbase? [17:12:43] godog: yes [17:12:46] at time of insertion I mean? [17:13:25] apergos: no, still in mathoid, but there are two requests that restbase makes - one for check and normalisation, the other for the render, but by default mathoid redoes the check and norm, but it's not needed [17:13:38] ok initiated by rb [17:13:39] got it [17:13:42] yes [17:13:54] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2029020 (10RobH) [17:14:06] yeah I'll merge this then [17:14:10] cool [17:14:18] (03CR) 10ArielGlenn: [C: 032] Mathoid: Add the render_no_check config option [puppet] - 10https://gerrit.wikimedia.org/r/270995 (owner: 10Mobrovac) [17:14:27] apergos: thanks! [17:14:35] not done uet [17:15:53] give me a mathoid host please? [17:16:08] scb1001 should have mathoid [17:16:11] 6Operations, 5Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2032039 (10Dzahn) >>! In T123525#2026727, @Danny_B wrote: > any objections? Yes, i would like to keep using this tracking task and not a workboard for this one. [17:16:12] scb1001 [17:16:16] apergos: ^ [17:17:38] now live on that host [17:17:51] kk, checkinf [17:18:13] apergos: lgtm on scb1001 [17:18:19] where does it log anyways? [17:18:42] apergos: /srv/log/mathoid/main.log [17:18:46] locally [17:18:53] when I looked there, there was nothing (no file0 [17:19:25] euh the file's there [17:19:27] 55M [17:20:00] oh /srv/log [17:20:01] heh [17:20:04] ok great [17:20:06] thanks [17:20:34] cheers! [17:21:34] (live on sbc1002 now, that's all of em) [17:21:54] no one signed up next, guess we lurk and see, godog [17:22:39] apergos: hehe ok! [17:22:42] 6Operations, 10ops-codfw: ms-be2017.codfw.wmnet: dev=sdm failed - https://phabricator.wikimedia.org/T127089#2032075 (10Papaul) a:3Papaul [17:23:38] 6Operations, 10ops-codfw: ms-be2016.codfw.wmnet: slot=0 dev=sdi failed - https://phabricator.wikimedia.org/T126630#2032080 (10Papaul) a:3Papaul [17:24:19] thnx godog and apergos! [17:24:46] 👍 [17:24:52] 6Operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2032090 (10Papaul) [17:25:29] (03PS3) 10Krinkle: Make $wgLocalStylePath the same as $wgStylePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270264 [17:25:35] (03CR) 10Krinkle: [C: 032] Make $wgLocalStylePath the same as $wgStylePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270264 (owner: 10Krinkle) [17:26:11] (03Merged) 10jenkins-bot: Make $wgLocalStylePath the same as $wgStylePath [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270264 (owner: 10Krinkle) [17:26:30] !log Reboot+reimage of db1021 (T126996) [17:26:34] > error: insufficient permission for adding an object to repository database .git/objects [17:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:26:44] Looks like tin staging area is damaged or someone was naughty with sudo [17:27:51] Yeah, lots and lots of root/root references in .git/objects/* [17:27:52] Krinkle: I had the same problem during SWAT. it's strange, it cleared up after I tried to fetch 2 or 3 times. [17:28:02] Krenair: did you ever file a bug about the .git/objects permissions issues? [17:28:06] hm.. indeed [17:28:09] it's possibel that _joe_ or andrewbogott might know something about it (*might*) [17:28:13] Running it again fixed it [17:28:14] weird [17:28:18] oh huh [17:28:21] that is weird [17:28:31] Maybe it purges one level at a time [17:28:34] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2032101 (10Aklapper) [17:28:35] and re-fetches from origin [17:28:36] bd808, don't think so [17:28:41] (if it knows they exist in origin) [17:30:15] Krinkle, thcipriani: this .git perms thing has been talked about on irc several times now. Somebody should open a bug to have it investigated. My guess is that there is some puppet something that is messing about but that's just a guess. [17:30:23] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2032106 (10RobH) So we don't have any in warranty systems that match these specifications, as they are quite low and codfw only has the new high performance misc sys... [17:30:32] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2032110 (10RobH) a:5RobH>3Joe [17:30:51] 6Operations, 10Traffic, 5Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2032116 (10BBlack) I took a look at another small sample of data today, over on the cache_upload clusters, which we'd expect to behave very differently. This was a single 10-minute run... [17:30:59] * thcipriani files task [17:31:44] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: wgLocalStylePath cleanup (use /static) (duration: 00m 59s) [17:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:31:48] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#1890778 (10RobH) a:5Joe>3Ottomata Actually, I should have assigned to @ottomata as he was the initial requester. [17:33:20] 6Operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2032126 (10Krinkle) [17:33:28] 6Operations, 10Deployment-Systems: error on tin:/srv/mediawiki-staging: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T127093#2032127 (10thcipriani) 3NEW [17:35:28] 6Operations, 10ops-codfw: ms-be2016.codfw.wmnet: slot=0 dev=sdi failed - https://phabricator.wikimedia.org/T126630#2032142 (10Papaul) Hi Papaul, This is regarding Case Number 4768231130 for Proliant Server HP DL380 GEN9 12LFF CTO SERVER Issue : 1 Failed HDD **********PART DETAILS********** 693720-001 4TB H... [17:35:35] 6Operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2032143 (10Krinkle) `$wgLocalStylePath` was using `/w/static/{wmfbranch}` instead o... [17:36:59] (03PS9) 10Krinkle: Set $wgResourceBasePath to "/w" for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [17:38:18] (03CR) 10Krinkle: [C: 032] "Scheduled for this week. To be riding (slightly ahead) along with the train. Not related to the MediaWiki branch, but using the same group" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [17:38:49] Testing the above on mw1017 now [17:38:55] (03Merged) 10jenkins-bot: Set $wgResourceBasePath to "/w" for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [17:39:34] 6Operations, 10ops-codfw: ms-be2017.codfw.wmnet: dev=sdm failed - https://phabricator.wikimedia.org/T127089#2032159 (10Papaul) Hi Papaul, This is regarding Case Number 4768231443 for Proliant Server HP DL380 GEN9 12LFF CTO SERVER Serial Number: MXQ54205QZ Issue : 1 Failed HDD in Bay 13 **********... [17:41:25] 6Operations, 10ops-codfw: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2032165 (10RobH) 3NEW a:3RobH [17:46:21] 6Operations, 10Deployment-Systems: error on tin:/srv/mediawiki-staging: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T127093#2032183 (10thcipriani) [17:48:16] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: Enable wmfstatic (wgResourceBasePath) for group0 wikis (duration: 01m 00s) [17:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:41] 6Operations, 10ops-codfw: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2032190 (10BBlack) For ulsfo user-facing traffic, the lowest point of the day is approximately a symmetrical dip centered on 20:00 UTC (Noon Pacific). So if they need a 4-hour window, ask... [17:52:51] 6Operations, 10ops-codfw, 10Traffic: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2032198 (10BBlack) And to be clear, what I expect we'll do on our end is depool ulsfo from users in DNS in our `config-geo` ahead of the window (let's say 2 hours ahead? the T... [17:54:00] 6Operations, 10ops-codfw, 10Traffic: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2032212 (10emailbot) **`Rob Halsell`** replied via email on `Tue, 16 Feb 2016 09:53:06 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure` > Phu/Support, >... [17:55:17] 6Operations, 10ops-codfw: ms-be2016.codfw.wmnet: slot=0 dev=sdi failed - https://phabricator.wikimedia.org/T126630#2032220 (10Papaul) p:5Triage>3Normal [17:55:33] 6Operations, 10ops-codfw: ms-be2017.codfw.wmnet: dev=sdm failed - https://phabricator.wikimedia.org/T127089#2032221 (10Papaul) p:5Triage>3Normal [17:57:21] 6Operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#2032234 (10jcrespo) [17:57:23] 6Operations, 10DBA, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2032235 (10jcrespo) [17:57:33] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2032240 (10Joe) @robh the specifications seem neat, we will probably need to refresh those hosts in two years right? That seems reasonable anyways. [17:57:52] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2032242 (10Ottomata) Those are beefier than the conf100xs, so they will certainly do just fine. However, since they are out of warrantee, it might be better to just... [18:00:02] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2032258 (10Joe) @ottomata given these are consistent distributed systems it's ok to use servers that are out of warranty on the premise that we'll replace them in th... [18:00:04] yurik gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160216T1800). [18:00:58] 6Operations, 10Traffic, 5Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2032264 (10BBlack) I re-ran the parsing script over the exact same input data as the last results, with finer-grained detail on the 4-12h range (I had captured the output at an intermed... [18:01:20] 6Operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2032268 (10Krinkle) Rollout to group1 and group2 is blocked on Apache config being... [18:01:26] puppet swat over, only one patch in the queue, so that was that [18:02:45] greg-g: would it be ok if i quickly touch and sync a particular js file in wikibase? [18:02:53] doit [18:02:57] to fix https://phabricator.wikimedia.org/T127095 [18:02:58] k [18:03:57] !log krinkle@tin Synchronized php-1.27.0-wmf.13/includes/mime.types: Fix .htc static (T99096) (duration: 00m 58s) [18:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:01] 6Operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2032287 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmfl... [18:06:10] !log aude@tin Synchronized php-1.27.0-wmf.13/extensions/Wikidata/extensions/Wikibase/repo/resources/dataTypes/wikibase.dataTypeStore.js: Try to fix T127095 (duration: 00m 57s) [18:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:10] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2032295 (10RobH) a:5Ottomata>3RobH [18:10:12] either i need to wait for caching or need to touch other files [18:11:44] mutante, do you have opinions about how to move ahead with https://gerrit.wikimedia.org/r/#/c/270026/ .. looks like it is a puppet patch either way. [18:12:54] (03CR) 10BBlack: [C: 031] Don't serve HiDPI thumbs on mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270793 (https://phabricator.wikimedia.org/T119797) (owner: 10Ori.livneh) [18:13:16] (03PS3) 10Ori.livneh: Don't serve HiDPI thumbs on mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270793 (https://phabricator.wikimedia.org/T119797) [18:13:27] (03CR) 10Ori.livneh: [C: 032] Don't serve HiDPI thumbs on mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270793 (https://phabricator.wikimedia.org/T119797) (owner: 10Ori.livneh) [18:13:40] think i found the problem and shall have a patch at swat [18:13:56] (03Merged) 10jenkins-bot: Don't serve HiDPI thumbs on mobile web [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270793 (https://phabricator.wikimedia.org/T119797) (owner: 10Ori.livneh) [18:14:42] (03PS1) 10Krinkle: mediawiki: Apply public-wiki-rewrites.incl to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/271013 [18:15:25] !log ori@tin Synchronized wmf-config/mobile.php: I5818f0350925: Don't serve HiDPI thumbs on mobile web (duration: 00m 58s) [18:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:46] thcipriani: did my new patch miss the train? Should I add it to the evening window? [18:15:59] (03CR) 10Krinkle: "@Giuseppe: Any particular reason 0c901c8ebe didn't use the include for www.mediawiki.org, and others mentioned at https://phabricator.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/271013 (owner: 10Krinkle) [18:16:25] bmansurov: yes, SWAT's done, next SWAT will be evening. [18:16:42] thcipriani: ok thanks [18:16:50] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2032338 (10RobH) a:5RobH>3mark After the above discussion on task, and an IRC discussion with both @ottomata and @joe, we have the following summary: The kafka... [18:17:21] (03PS1) 10Ottomata: Add eventbus topic config for revision_create [puppet] - 10https://gerrit.wikimedia.org/r/271014 [18:22:47] subbu: is there actually a problem with running that script since i changed the permissions? (if you do _not_ run as root/with sudo)? [18:23:32] mutante, that script also restarts services which arlo and scott reuqire sudo for. [18:23:33] re: Tim's comment about needing an account in wikidev, the users already were in wikidev afaict [18:23:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [18:24:10] so, i cannot remove sudo from running the script update_parsoid.sh => the git pull also runs as a sudo root [18:24:23] reqerrors are not spiking [18:24:27] subbu: but the users alreayd have sudo rules for restarting the service? [18:25:07] (03CR) 10Ottomata: [C: 032] Add eventbus topic config for revision_create [puppet] - 10https://gerrit.wikimedia.org/r/271014 (owner: 10Ottomata) [18:25:27] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [18:25:44] subbu: so they could deploy as wikidev and then restart the service as root, just that it won't work in a single script. right? [18:26:15] right. so, you are suggesting that i update the script .. https://github.com/wikimedia/operations-puppet/blob/production/modules/parsoid/files/parsoid_testing.update_parsoid.sh [18:26:28] or rather you are suggesting i delete that script and have them run different commands. [18:26:38] git pull without sudo and restart services with sudo. [18:27:03] i wasn't at the suggestion phase yet, but i think so, yea [18:27:07] but, that feels a bit more cumbersome and potentially error prone since if parsoid is not restarted before the clients are restarted, the test results till that error is caught will be incorrect. [18:27:22] i prefer having a single script that gets all the steps in the right order. [18:27:30] (03PS2) 10Andrew Bogott: Added config files for Openstack Liberty [puppet] - 10https://gerrit.wikimedia.org/r/270891 [18:28:15] (03PS1) 10Andrew Bogott: Don't configure a secondary labs salt master on new labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/271019 (https://phabricator.wikimedia.org/T126580) [18:28:43] subbu: how about changing the script to just put "sudo" in front of the service restart command [18:29:09] subbu: and then running the script as regular user [18:29:50] sure .. i can do that. i'll have to submit a different puppet patch then. [18:30:31] 6Operations, 10Traffic, 5Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2032419 (10BBlack) next upload datapoint is this. This is an esams backend instance (pulls from eqiad, gets requests from esams frontends): ``` Total: 704980 1s+: 100.00% 1m+: 99.95... [18:30:35] is there a reason why you prefer not running the existing script with sudo root? [18:30:43] (03CR) 10Andrew Bogott: [C: 032] Don't configure a secondary labs salt master on new labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/271019 (https://phabricator.wikimedia.org/T126580) (owner: 10Andrew Bogott) [18:35:23] moritzm: fyi https://gerrit.wikimedia.org/r/#/c/270960/ we got bit again [18:35:29] the 'resolve' function in ferm [18:36:26] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:37:57] apergos: there hasn't been any activity on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=794565 so far... [18:38:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:38:26] I saw the report earlier when I was hunting around [18:38:29] disappointing [18:38:32] but when did that happen, was mx* only resolvable via ipv6 for a while? [18:39:14] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762726 is also really sad [18:39:41] no only via ipv4 right now [18:39:46] on phab [18:40:06] until a) that patch gets merged and b) we have puppet running again on that box [18:40:09] (03PS1) 10Volans: icinga: Authorize myself (Volans) [puppet] - 10https://gerrit.wikimedia.org/r/271022 (https://phabricator.wikimedia.org/T126431) [18:40:18] ah, ok [18:40:25] only resolvable with ipv4 since forever, for phab I guess [18:41:03] oh sadness [18:41:19] I'll review later on, off for dinner # [18:41:23] yeah that's a depressing response all right on that report [18:41:29] enjoy! [18:42:58] (03CR) 10BryanDavis: "Does this fix the problem caused by I311f42ff0785bfdd92676b40cfdba8caa195a8fc ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) (owner: 10Bmansurov) [18:44:15] (03CR) 10Bmansurov: "I'm not sure. Where can I read the errors that the patch caused?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) (owner: 10Bmansurov) [18:44:56] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2032467 (10ori) Traffic on the mc10* boxes is not balanced: {F3364548 size=full} [18:46:50] (03CR) 10Dzahn: [C: 031] icinga: Authorize myself (Volans) [puppet] - 10https://gerrit.wikimedia.org/r/271022 (https://phabricator.wikimedia.org/T126431) (owner: 10Volans) [18:47:57] (03CR) 10Subramanya Sastry: "Looks like Dzahn prefers:" [puppet] - 10https://gerrit.wikimedia.org/r/270026 (owner: 10Subramanya Sastry) [18:52:00] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2032487 (10ori) Some config differences: ```lang=diff --- mc1014.conf 2016-02-16 10:50:19.000000000 -0800 +++ mc1005.conf 2016-02-16 10:50:34.000000000 -0800 @@ -1,... [18:52:11] 6Operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#2032489 (10egalvezwmf) Thanks @chasemp for pasting that - I missed that somehow in the previous conversation. I want to keep this task open for now and in the backlog on the #surveys board assigned... [18:52:22] 6Operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#2032490 (10egalvezwmf) a:3egalvezwmf [18:52:49] (03CR) 10Dzahn: [C: 032] "matches "sn" in LDAP for the wikitech user" [puppet] - 10https://gerrit.wikimedia.org/r/271022 (https://phabricator.wikimedia.org/T126431) (owner: 10Volans) [18:57:10] grrrit-wm seems likely to have died in response to upload->merge of https://gerrit.wikimedia.org/r/#/c/271024/ [18:57:34] maybe the long commitmsg line length (sorry, that was from gerrit's revert editor), or the TN#X reference in it? [19:00:55] * hashar !log tin: checking out mw 1.27.0-wmf.14 [19:01:07] !log tin: checking out mw 1.27.0-wmf.14 [19:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:19] (03CR) 10BBlack: "I mostly did this for consistency (tired of having greps work differently per-dc in DNS zonefile, etc). If someone had a plan long ago to" [dns] - 10https://gerrit.wikimedia.org/r/270285 (owner: 10BBlack) [19:01:44] 6Operations, 10ops-codfw, 10Traffic: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2032508 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Tue, 16 Feb 2016 10:59:45 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failur... [19:03:20] (03PS1) 10Volans: icinga: Matches contact name with LDAP sn [puppet] - 10https://gerrit.wikimedia.org/r/271026 (https://phabricator.wikimedia.org/T126431) [19:04:24] (03CR) 10Dzahn: [C: 032] icinga: Matches contact name with LDAP sn [puppet] - 10https://gerrit.wikimedia.org/r/271026 (https://phabricator.wikimedia.org/T126431) (owner: 10Volans) [19:07:38] (03PS1) 10CSteipp: Add authmanager events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271028 [19:08:57] (03PS1) 10Ori.livneh: Revert "Test HTML stripping in production mobile beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271029 [19:09:49] (03PS2) 10CSteipp: Add authmanager events to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271028 [19:12:29] (03PS2) 10Ori.livneh: Revert "Test HTML stripping in production mobile beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271029 [19:12:57] (03CR) 10Ori.livneh: [C: 032] Revert "Test HTML stripping in production mobile beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271029 (owner: 10Ori.livneh) [19:13:13] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 53.85% of data above the critical threshold [5000000.0] [19:13:32] (03Merged) 10jenkins-bot: Revert "Test HTML stripping in production mobile beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271029 (owner: 10Ori.livneh) [19:16:41] !log ori@mira Synchronized wmf-config/InitialiseSettings.php: I2926f73b78fa0: Revert "Test HTML stripping in production mobile beta" (duration: 01m 29s) [19:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:17:43] 6Operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2032593 (10Dzahn) >>! In T126295#2010421, @awight wrote: > On WMF Labs, https://wikitech.wikimedia.org/wiki/Nova_Resource:Globaleduca... [19:18:32] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:20:20] 6Operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2032599 (10Dzahn) >>! In T126295#2010434, @dduvall wrote: > They're looking to migrate their codebase to Rails 5, which requires Ruby... [19:21:52] (03CR) 10Dzahn: [C: 031] mediawiki: Remove dead-end redirect at /stats for chapter wikis [puppet] - 10https://gerrit.wikimedia.org/r/270788 (owner: 10Krinkle) [19:24:04] 6Operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2032601 (10dduvall) >>! In T126295#2032593, @Dzahn wrote: >>>! In T126295#2010421, @awight wrote: >> On WMF Labs, https://wikitech.wi... [19:24:35] (03CR) 10Dzahn: "I would appreciate comments from other ops members on this one." [puppet] - 10https://gerrit.wikimedia.org/r/270026 (owner: 10Subramanya Sastry) [19:25:32] 6Operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2032603 (10dduvall) >>! In T126295#2032599, @Dzahn wrote: >>>! In T126295#2010434, @dduvall wrote: >> They're looking to migrate thei... [19:27:57] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2032613 (10hashar) Per discussion with Ori I have added to the Grafana board at https://grafana.wikimedia.org/dashboard/db/t126700 the 75 percentile of metric `hhvm.... [19:29:09] (03CR) 10Alex Monk: [C: 031] mediawiki: Remove dead-end redirect at /stats for chapter wikis [puppet] - 10https://gerrit.wikimedia.org/r/270788 (owner: 10Krinkle) [19:30:29] (03CR) 1020after4: [C: 031] phabricator: allow port 25 for ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [19:30:37] (03PS4) 10Dzahn: Puppetise yhsm-daemon [puppet] - 10https://gerrit.wikimedia.org/r/270728 (owner: 10Muehlenhoff) [19:30:51] 6Operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2032637 (10dduvall) [19:30:56] 6Operations, 10Education-Program-Dashboard: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#2032639 (10Ragesoss) >>! In T126295#2032603, @dduvall wrote: >>>! In T126295#2032599, @Dzahn wrote: >>>>! In T126295#2010434, @dduval... [19:31:14] (03PS1) 10Hashar: multiversion: updateBranchPointers miss a require [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271035 [19:32:27] (03CR) 10ArielGlenn: "no sense in actually merging this til puppet is enabled again on the box btw" [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [19:32:42] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2032646 (10Dzahn) >>! In T127053#2031318, @ArielGlenn wrote: > https://gerrit.wikimedia.org/r/#/c/270960/ why did this not show up on the ticket when the task number i... [19:33:11] (03CR) 10Hashar: "Apparently causes checkoutMediaWiki to die with:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268813 (owner: 10Krinkle) [19:33:41] (03PS3) 10Dzahn: phabricator: allow port 25 for ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [19:34:03] (03CR) 10Krinkle: [C: 032] multiversion: updateBranchPointers miss a require [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271035 (owner: 10Hashar) [19:34:22] (03CR) 10Dzahn: "eh, so why is puppet disabled?" [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [19:34:44] (03Merged) 10jenkins-bot: multiversion: updateBranchPointers miss a require [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271035 (owner: 10Hashar) [19:35:03] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2032649 (10mmodell) From what I can remember, iridium hasn't had full ipv6 support until recently. I guess it's taken more work due to it being a somewhat special case... [19:36:01] (03PS1) 10Ottomata: Fix unless command on oozie_mysql_create_schema exec [puppet/cdh] - 10https://gerrit.wikimedia.org/r/271038 [19:36:22] PROBLEM - Disk space on labvirt1008 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 90495 MB (3% inode=99%) [19:36:25] (03CR) 10Dzahn: "was about to merge, but not doing it because puppet is disabled with "(Reason: 'more testing needed before we can enable puppet for phab " [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [19:36:34] (03CR) 10Ottomata: [C: 032] Fix unless command on oozie_mysql_create_schema exec [puppet/cdh] - 10https://gerrit.wikimedia.org/r/271038 (owner: 10Ottomata) [19:37:34] (03PS1) 10Ottomata: Update cdh module with oozie db create exec fix [puppet] - 10https://gerrit.wikimedia.org/r/271039 [19:37:43] (03CR) 10Dzahn: [C: 031] Puppetise yhsm-daemon [puppet] - 10https://gerrit.wikimedia.org/r/270728 (owner: 10Muehlenhoff) [19:37:47] (03PS2) 10Ottomata: Update cdh module with oozie db create exec fix [puppet] - 10https://gerrit.wikimedia.org/r/271039 [19:37:54] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with oozie db create exec fix [puppet] - 10https://gerrit.wikimedia.org/r/271039 (owner: 10Ottomata) [19:38:50] !log hashar@tin Synchronized multiversion/updateBranchPointers: Missing require_once MWWikiversions (duration: 00m 57s) [19:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:22] (03PS5) 10Dzahn: Puppetise yhsm-daemon [puppet] - 10https://gerrit.wikimedia.org/r/270728 (owner: 10Muehlenhoff) [19:41:03] (03CR) 10ArielGlenn: "Yes, having just the one command in the script run with root privileges is better. And it can use the user privileges that already exist i" [puppet] - 10https://gerrit.wikimedia.org/r/270026 (owner: 10Subramanya Sastry) [19:41:19] andrewbogott: hmm, space alert for labvirt1008 [19:41:39] (03CR) 10Dzahn: [C: 032] "new setup on auth1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/270728 (owner: 10Muehlenhoff) [19:42:38] yuvipanda: I will look... [19:43:24] (03PS2) 10Dzahn: Remove Wikimedia Foundation English blog from cs.planet [puppet] - 10https://gerrit.wikimedia.org/r/270483 (owner: 10Dereckson) [19:43:53] (03CR) 10Dzahn: [C: 032] Remove Wikimedia Foundation English blog from cs.planet [puppet] - 10https://gerrit.wikimedia.org/r/270483 (owner: 10Dereckson) [19:44:28] andrewbogott: thanks [19:45:06] is Chris really still on duty or should the topic be updated? [19:45:10] (03CR) 10BryanDavis: "The error reported to Logstash was:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) (owner: 10Bmansurov) [19:45:12] who is on duty this week? [19:46:00] 6Operations, 7Icinga: improve icinga performance / solve general load issues on neon - https://phabricator.wikimedia.org/T85222#2032682 (10Southparkfan) I am for sure not the most experienced person here, but I'll see where I can help. So, I took a look at http://docs.icinga.org/latest/en/tuning.html. It has... [19:46:43] PROBLEM - puppet last run on auth1001 is CRITICAL: CRITICAL: puppet fail [19:48:11] ^ yea, i'll take care of that one [19:48:20] it's a new setup and what i just merged earlier [19:50:22] (03PS4) 10Alex Monk: New group/right/protection level for the English Wikipedia: establishededitor (?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) [19:50:56] (03PS2) 10Dzahn: Remove 404 face from wikipediste cs.planet entry [puppet] - 10https://gerrit.wikimedia.org/r/270487 (owner: 10Dereckson) [19:51:13] (03CR) 10Dzahn: [C: 032] "it's not 404, actually it's 303 See Other, but anyways.." [puppet] - 10https://gerrit.wikimedia.org/r/270487 (owner: 10Dereckson) [19:53:44] (03PS1) 10Ottomata: Configure Analytics Cluster in beta deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/271044 (https://phabricator.wikimedia.org/T109859) [19:54:59] (03PS2) 10Ottomata: Configure Analytics Cluster in beta deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/271044 (https://phabricator.wikimedia.org/T109859) [19:57:13] (03CR) 10Ottomata: [C: 032] Configure Analytics Cluster in beta deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/271044 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [19:57:37] (03PS4) 10Krinkle: phabricator: allow port 25 for ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [19:57:55] 6Operations, 10ops-codfw, 10Traffic: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2032729 (10emailbot) **`Rob Halsell`** replied via email on `Tue, 16 Feb 2016 11:57:31 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure` > Tomorrow, 2016-0... [19:58:43] apergos: meta data should be in the footer (similar to email/git/http heades but the other way around). So previously "Bug:" was in the body, it's now in teh footer with my amend [19:58:49] which makes it recognised by various systems [19:59:05] (03PS1) 10Subramanya Sastry: ruthenium: Updated update_parsoid.sh to run it as regular user [puppet] - 10https://gerrit.wikimedia.org/r/271047 [19:59:15] (03CR) 10Cenarium: [C: 031] New group/right/protection level for the English Wikipedia: establishededitor (?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [19:59:48] Krinkle: ah ha [19:59:53] that must be it, thanks [20:00:05] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160216T2000). [20:00:18] apergos: maybe you or your editor are in the habit of appending an extra new line, which then interacts badly with the autocommit hook that adds the change id [20:00:30] I do in fact add that line for readability [20:00:36] so I can not do that [20:00:41] :) [20:00:45] (03PS3) 10Dzahn: Remove 404 face from wikipediste cs.planet entry [puppet] - 10https://gerrit.wikimedia.org/r/270487 (owner: 10Dereckson) [20:00:59] thanks for that tip, very helpful [20:01:49] 6Operations, 10ops-codfw, 10Traffic: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2032740 (10RobH) [20:03:28] (03Abandoned) 10Subramanya Sastry: sudo: Run update_parsoid.sh as root, not parsoid [puppet] - 10https://gerrit.wikimedia.org/r/270026 (owner: 10Subramanya Sastry) [20:06:40] (03Abandoned) 10Dzahn: purge webrequest logs after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [20:07:24] (03CR) 10Dzahn: [C: 031] "should be done with another restbase deploy / people from services around" [puppet] - 10https://gerrit.wikimedia.org/r/270481 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [20:09:04] (03PS4) 10Hashar: Enable Kartographer ext in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 (https://phabricator.wikimedia.org/T114820) (owner: 10Yurik) [20:09:46] (03CR) 10Hashar: [C: 032] Enable Kartographer ext in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 (https://phabricator.wikimedia.org/T114820) (owner: 10Yurik) [20:09:49] (03PS2) 10Dzahn: mediawiki: add texlive-generic-extra [puppet] - 10https://gerrit.wikimedia.org/r/270322 (https://phabricator.wikimedia.org/T126422) (owner: 10Hashar) [20:10:27] (03Merged) 10jenkins-bot: Enable Kartographer ext in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 (https://phabricator.wikimedia.org/T114820) (owner: 10Yurik) [20:11:21] AaronSchulz, sorry, I am doing a crappy job, but how with the unit tests you get the idea. 10000 apologies in advance [20:11:33] s/how/hope/ [20:12:22] just saw the new ps :) [20:12:28] (03CR) 10GWicke: [C: 031] RESTBase configuration for pt.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/270481 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [20:13:14] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2032824 (10Dzahn) The change above looks good and i would have merged it but i did not because puppet is disabled in iridium, with: (Reason: 'more testing needed before... [20:13:44] (03CR) 10Dzahn: [C: 032] mediawiki: add texlive-generic-extra [puppet] - 10https://gerrit.wikimedia.org/r/270322 (https://phabricator.wikimedia.org/T126422) (owner: 10Hashar) [20:14:39] !log restarted phd on iridium to kick-start the outbound email queue [20:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:27] (03CR) 10Dzahn: "that sounds like a good plan, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/170130 (owner: 10Cscott) [20:16:03] 6Operations, 10ops-codfw, 10Traffic: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2032842 (10RobH) a:5RobH>3BBlack I'm reassigning this from myself to @bblack (and adding the #netops project tag for the traffic move.) [20:19:15] (03PS3) 10Bmansurov: Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) [20:19:32] (03CR) 10Bmansurov: "Now it does" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) (owner: 10Bmansurov) [20:20:22] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:22] (03PS1) 10Chad: Remove old expired branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271052 [20:22:13] (03CR) 10Chad: [C: 032] Remove old expired branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271052 (owner: 10Chad) [20:22:31] !log demon@tin Started scap: removing expired wmf.9 branch [20:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:41] (03Merged) 10jenkins-bot: Remove old expired branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271052 (owner: 10Chad) [20:22:59] jynus: the strcmp looks wrong though. Can you add some tests comparing a coordinate like 9999 to 10000? Also, it seems like the code already just looks at the integers and not the hostname. The PS12 failure is interesting. If the file number becomes lower after switchover, that will be problematic for that method. Maybe it could return null if it can't tell [20:22:59] (e.g. host change) and the masterPosWait() caller can special case that by returning or something. [20:23:41] no, it seializes it [20:23:57] then converts it int a new arry [20:24:26] ah, the hostname, true [20:24:45] AaronSchulz, that is exactly what is happening to us [20:25:06] race condition, maybe [20:25:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:16] but I have 10 job runners still waiting for db1024 log positions [20:26:31] when the master is db1018 for a week already [20:27:24] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [20:27:42] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [20:29:28] (03PS1) 10Dzahn: yubiauth: fix include of yhsm daemon class [puppet] - 10https://gerrit.wikimedia.org/r/271055 [20:29:42] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:30:20] (03CR) 10Dzahn: [C: 032] yubiauth: fix include of yhsm daemon class [puppet] - 10https://gerrit.wikimedia.org/r/271055 (owner: 10Dzahn) [20:30:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:38] (03PS1) 10Dzahn: yubiauth: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/271057 [20:35:03] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:35:05] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2032943 (10ArielGlenn) That's why I commented on the changeset to not merge it. Puppet will remain disabled over there til this https://gerrit.wikimedia.org/r/#/c/2695... [20:35:13] jynus: so the first question I had was: if the pos is dbOLD-bin.X1/Y1 and the new master has pos dbNEW-bin.X2/Y2 after switch over, is (X2,Y2) *always* >= (X1,Y1)? I assume so but need to be 100% sure. If so, we can change masterPosWait() to fudge the binlog name in the MASTER_POS_WAIT query to the new host. If not, I guess it can just bail but return true. [20:35:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:53] jynus: if there are a few days of jobs in queue the serialized master positions enqueued as part of the job's definition can linger around for days. [20:36:03] (03PS1) 10BBlack: Revert "Revert "ttl_fixed: limit to tier-1 backends"" [puppet] - 10https://gerrit.wikimedia.org/r/271058 [20:36:10] (03PS2) 10BBlack: Revert "Revert "ttl_fixed: limit to tier-1 backends"" [puppet] - 10https://gerrit.wikimedia.org/r/271058 [20:36:15] that's probably why old master names keep showing up in MASTER_POS_WAIT [20:36:17] (03CR) 10BBlack: [C: 032 V: 032] Revert "Revert "ttl_fixed: limit to tier-1 backends"" [puppet] - 10https://gerrit.wikimedia.org/r/271058 (owner: 10BBlack) [20:36:47] (as opposed to delay in LoadBalancer noticing config changes) [20:37:49] 6Operations, 10Traffic, 5Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2032965 (10BBlack) (note I've edited some of my cache_upload commentary above to remove questions/mysteries that turned out to mostly be my own braindeadness) [20:39:08] (03PS1) 10Dzahn: yubiauth: fix class name to use underscore [puppet] - 10https://gerrit.wikimedia.org/r/271059 [20:40:14] (03PS2) 10Dzahn: yubiauth: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/271057 [20:40:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:22] (03CR) 10Dzahn: [C: 032] yubiauth: move role to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/271057 (owner: 10Dzahn) [20:41:09] (03PS2) 10Dzahn: yubiauth: fix class name to use underscore [puppet] - 10https://gerrit.wikimedia.org/r/271059 [20:43:28] _joe_: I'm wondering if (going forward) we should name our deploy masters instead of using the local dc's misc naming scheme. So we'd have something like deploy1001 & deploy2001, etc. We did it with bastions last go round... [20:44:02] It'd allow us to spin up new ones a tad easier since the naming scheme would be predictable. [20:44:25] (03PS1) 10Ottomata: Use project-$labsproject in labs for /var/log/refinery [puppet] - 10https://gerrit.wikimedia.org/r/271062 (https://phabricator.wikimedia.org/T109859) [20:45:07] (03CR) 10Dzahn: [V: 032] "it hurts but i have to V+2 as well then" [puppet] - 10https://gerrit.wikimedia.org/r/271057 (owner: 10Dzahn) [20:45:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:45:39] (03PS2) 10Ottomata: Use project-$labsproject in labs for /var/log/refinery [puppet] - 10https://gerrit.wikimedia.org/r/271062 (https://phabricator.wikimedia.org/T109859) [20:45:47] (03CR) 10Dzahn: [C: 032] yubiauth: fix class name to use underscore [puppet] - 10https://gerrit.wikimedia.org/r/271059 (owner: 10Dzahn) [20:45:56] (03CR) 10Dzahn: [V: 032] yubiauth: fix class name to use underscore [puppet] - 10https://gerrit.wikimedia.org/r/271059 (owner: 10Dzahn) [20:46:29] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2032973 (10MBinder_WMF) Is this task in the appropriate place and condition to get attention? Happy to prioritize, tag, etc., to expedite resolution. Even if we just extracted... [20:48:11] (03PS3) 10Ottomata: Use project-$labsproject in labs for /var/log/refinery [puppet] - 10https://gerrit.wikimedia.org/r/271062 (https://phabricator.wikimedia.org/T109859) [20:48:20] (03CR) 10Ottomata: [C: 032 V: 032] Use project-$labsproject in labs for /var/log/refinery [puppet] - 10https://gerrit.wikimedia.org/r/271062 (https://phabricator.wikimedia.org/T109859) (owner: 10Ottomata) [20:49:01] 6Operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#2032982 (10ssastry) We now have a bare metal labs hardware that we'll configure for use for v... [20:50:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:50:35] !log demon@tin Finished scap: removing expired wmf.9 branch (duration: 28m 03s) [20:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:40] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2032985 (10Dzahn) > We use the pad weekly and it has archives of information critical to our Tuesday meetings. A comment separate from this specific task, but: Please do not... [20:54:37] (03CR) 10Thcipriani: [C: 031] "One minor nitpick, fixes the problem of no output on failure, which is the main thing." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270902 (https://phabricator.wikimedia.org/T110407) (owner: 10Hashar) [20:55:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:39] (03PS2) 10Muehlenhoff: Add ferm rule for eventlogging zmq forwarder service [puppet] - 10https://gerrit.wikimedia.org/r/270915 [20:56:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rule for eventlogging zmq forwarder service [puppet] - 10https://gerrit.wikimedia.org/r/270915 (owner: 10Muehlenhoff) [20:56:54] (03PS2) 10Muehlenhoff: Add ferm rule for eventlogging mediawiki exception/fatal relay [puppet] - 10https://gerrit.wikimedia.org/r/270920 (https://phabricator.wikimedia.org/T113343) [20:57:12] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rule for eventlogging mediawiki exception/fatal relay [puppet] - 10https://gerrit.wikimedia.org/r/270920 (https://phabricator.wikimedia.org/T113343) (owner: 10Muehlenhoff) [21:00:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:00:51] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:01:17] 6Operations: Some labvirt systems use qemu from "cloud archive" (which doesn't get security support) - https://phabricator.wikimedia.org/T127113#2033017 (10MoritzMuehlenhoff) 3NEW [21:05:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:22] RECOVERY - Disk space on labvirt1008 is OK: DISK OK [21:15:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:19:35] (03PS1) 10Chad: Remove live-1.5 symlink to w/ directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271072 [21:20:14] Pretty long commit summary to remove a symlink :p [21:20:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:37] heh, last time I touched this, I undeployed Wikipedia ;) [21:21:46] How long ago was that? [21:21:49] (03CR) 1020after4: "I think this one is ready to merge. :)" [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [21:21:54] long enough [21:23:39] those Icinga alerts about fundraising [21:23:43] those are real issues [21:23:47] and the fr wiki is gone [21:24:46] what? [21:24:59] (03PS11) 1020after4: make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T114363) [21:25:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:18] * twentyafterfour can't even [21:27:39] OuKB: The p/ could probably also go too [21:27:42] it's being discussed over in -fundraising [21:27:54] ok good [21:28:29] OuKB: But probably impossible to grep for :p [21:30:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:32:21] wmf-deployment seems unused too [21:32:56] eh, there's not much "discussion" heh [21:33:11] i called Jeff [21:33:16] he is at an airport [21:33:28] we need one of the Chris' [21:33:33] mutante: s/discussion/pinging the right people/ ;) [21:33:34] mutante: yeah sorry, i can't even get to that box [21:35:22] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:37:23] <_joe_> ostriches: I was thinking the same this morning [21:40:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:40:46] (03PS1) 10Jdlrobson: Strip references for experimentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271074 (https://phabricator.wikimedia.org/T126390) [21:45:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:35] (03CR) 10Jforrester: Strip references for experimentation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271074 (https://phabricator.wikimedia.org/T126390) (owner: 10Jdlrobson) [21:47:33] 6Operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2033171 (10Papaul) @jcrespo. The onsite setup for es20[1-9][0-9] is complete. As you requested for Volans to do a full install, he can work on es2019. @Volans let me know if you have any questio... [21:47:59] (03CR) 10Jdlrobson: Strip references for experimentation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271074 (https://phabricator.wikimedia.org/T126390) (owner: 10Jdlrobson) [21:48:02] (03CR) 10Jforrester: Strip references for experimentation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271074 (https://phabricator.wikimedia.org/T126390) (owner: 10Jdlrobson) [21:48:22] jdlrobson: :-) [21:48:57] (03PS2) 10Jdlrobson: Strip references for experimentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271074 (https://phabricator.wikimedia.org/T126390) [21:50:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:42] Can I get a root to `chgrp -R mwdeploy /srv/mediawiki-staging/.git/objects && chown -R g+w /srv/mediawiki-staging/.git/objects` [21:52:03] ostriches: where tin? [21:52:13] Ah yes, sorry [21:52:16] or 'where, tin?' [21:52:18] sure [21:53:23] is chown supposed to be chmod maybe? [21:53:26] before I run this [21:54:38] Yes. [21:54:43] I gotcha :) [21:55:01] !log "chgrp -R mwdeploy /srv/mediawiki-staging/.git/objects && chmod -R g+w /srv/mediawiki-staging/.git/objects" on tin [21:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:13] Thanks chasemp :) [21:55:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:55:33] is that still an issue? [21:55:50] people were reporting problems there early but after a second run the problem would clear up [21:55:51] earlier [21:56:32] idk mutante, do you need help getting ahold of FR? I am under the impression they are aware of this but maybe not [21:56:57] ostriches: [21:57:02] 6Operations: Some labvirt systems use qemu from "cloud archive" (which doesn't get security support) - https://phabricator.wikimedia.org/T127113#2033227 (10Andrew) Is there an upstream bug with the cloud archive? [21:57:14] chasemp: nobody who has access to the server is online right now [21:58:11] apergos: you meant the chgrp on tin there? [21:58:13] that's pretty troubling, is katie in the office? maybe someone can track her down [21:58:19] mutante: yeah [21:58:27] apergos: It happens when people accidentally do stuff with git in mw-config or MW directories as root. [21:58:32] apergos: yea, we keep repeating that, i'm wondering too [21:58:41] Honest mistake, but it ends up with git owning a bunch of dirs [21:58:43] :) [21:58:47] chasemp: it has been tried but unsuccesfully [21:58:50] root owning a bunch of dirs [21:59:09] yeah maybe they jut got their particular dirs sorted out after a retry or something [21:59:21] all right well hopefully that's the end of it for awhile [21:59:31] chasemp: Jeff will be back on it in a couple hours, that's the best we got [21:59:47] mutante: is there a sense of user impact? I don't know what the deal is [22:00:00] has anyone tried to actually call katie's cell? [22:00:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:11] cwd: ^ [22:01:46] git -C /srv/patches/ config core.sharedRepository true [22:01:54] I don't see that check defined in puppet. fr puppet I guess? [22:02:24] Ugh. [22:02:42] I still can't write on g+w when it's root:mwdeploy [22:02:57] !log tin/mira : git -C /srv/patches/ config core.sharedRepository true [22:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:03:13] chasemp: i'm not sure if "fr puppet" is something that exists [22:03:31] well they have their own puppet repo yeah [22:03:40] but not one that adds icinga checks to neon [22:03:49] unless via NSCA [22:03:58] I don't know where their icinga checks live [22:04:02] hmm.. it must be somewhere in our repo [22:04:28] I just want to know what is failing to make a better call on it, it sure seems like it's not mission critical [22:04:31] hashar: You sure we want true? [22:04:53] it is shared between folks under /srv/patches isn't it? [22:05:00] chasemp: it's the wiki on payments.wikimedia.org, i dont know who uses it [22:05:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:05:36] (03PS1) 10Subramanya Sastry: ruthenium: Clone the parsoid repo with 0775 mode [puppet] - 10https://gerrit.wikimedia.org/r/271082 [22:05:38] ostriches: ideally we would also want an internal / private git servers to hold the various private stuff we have. And make sure it refuses non fast forward pushes :D [22:05:56] hashar: We should probably use "all" instead. [22:06:09] That'll make sure they're all world-readable but group-writable. [22:06:13] mutante: ok I see you in -fundraising thanks [22:06:27] chasemp: the potential for user facing impact would be payments wiki going down [22:06:41] but i have not been able to observe that [22:06:57] i have no idea what the load balancer setup is but maybe it depooled? [22:09:01] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [22:10:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:13:33] cwd: have you tried katie's phone # in the contact list? (I"m hoping that's a cell?) [22:14:02] i haven't but i can [22:14:13] please [22:15:19] thanks cwd [22:15:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:01] wow her message is incredible [22:17:16] anyway i had to leave a message but i asked her to hop on here when she gets it [22:17:30] (03PS1) 10Dzahn: yubiauth: fix inclusion from modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/271130 [22:17:31] ok well that is the best that can be done as far as getting her on [22:17:32] thank you [22:18:02] np, i'm sorry we have no useful protocol here [22:18:15] I got no access to fr stuff, mostly ops does not [22:18:26] which is why we're in this crap situation [22:18:29] yeah afaik jeff is the only one [22:18:31] the worst case, Jeff will be back on it in a couple hours he said [22:18:33] once he gets there [22:18:37] i'm sure it has to do with insane PCI rules [22:18:38] I guess that's how it is [22:18:47] it seems so far we can survive waiting a bit [22:18:50] you'll want to figure that out in the wake of this I guess [22:19:22] yep, we at least need some big red buttons [22:19:35] (03PS1) 10Papaul: ADD es201[1-9] to Dhcpd Bug:T126006 [puppet] - 10https://gerrit.wikimedia.org/r/271131 (https://phabricator.wikimedia.org/T126006) [22:19:58] make sure someone else on the fr team has the keys and the know how [22:20:02] so there's always someone accessible [22:20:13] I guess this stuff is easy in hindsight [22:20:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:06] Are there changes to OAuth that are on test.wikipedia.org but that have not hit production yet? [22:24:32] (03PS2) 10Papaul: ADD es201[1-8] to Dhcpd Bug:T126006 [puppet] - 10https://gerrit.wikimedia.org/r/271131 (https://phabricator.wikimedia.org/T126006) [22:25:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:25:31] (03PS2) 10Dzahn: yubiauth: fix inclusion from modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/271130 [22:25:37] (03PS1) 10Hoo man: Add me (hoo) to the wdqs icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/271132 [22:25:41] (03CR) 10Dzahn: [C: 032] yubiauth: fix inclusion from modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/271130 (owner: 10Dzahn) [22:25:48] reached awight, he confirmed only jeff has access to the servers [22:25:57] ^ easy one ;) [22:26:41] I wonder what pci has to say about bus factor [22:29:46] Hi! [22:30:00] apergos: yeah, I think cmjohnson is his second [22:30:01] hey [22:30:13] cmjohnson is unavailable too [22:30:17] k [22:30:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:25] next time: two planes :) [22:30:31] so cwd tried to reach katie [22:30:40] by phone. no joy. she's not on irc right now [22:30:51] at this point we are "jeff will be on in 2 hours" [22:30:53] She has no special access there [22:30:55] either [22:30:56] any better ideas? [22:31:08] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/271130 (owner: 10Dzahn) [22:31:16] (03PS1) 10Yuvipanda: dynamicproxy: Fix pointless typo [puppet] - 10https://gerrit.wikimedia.org/r/271133 [22:31:18] (03PS1) 10Yuvipanda: labstore: Do not try to create accounts on labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/271134 [22:31:24] apergos: Do you know anything about the load balancer? Has it depooled 1001? [22:31:35] I know nothing about fr whatsoever [22:31:44] afaik I have no access to any of it [22:31:49] dang [22:31:58] that's pci rules [22:32:00] apergos: I thought all the opsen had access... [22:32:07] i've been combing logs on indium and not seeing anything worrying [22:32:09] wouldn't one of the DC ops be able to get in? [22:32:43] cmjohnson [22:32:43] cwd: If there's any chance that donors are hitting the dead machine, u should take campaigns down... [22:32:48] also not available, Krenair [22:33:05] ah, yes, sorry, didn't read up far enough clearly :) [22:33:11] 's fine [22:33:38] awight: are there even any campaigns right now? [22:33:55] good to hear it :) [22:34:15] i will ask around [22:34:23] cool [22:34:30] basically it's all you guys til jeff shows :-D [22:34:32] cwd: I suppose it wouldn't help to put the other payments boxes into maintenance mode. [22:34:32] sorry... [22:34:35] (03CR) 10Yuvipanda: [C: 032 V: 032] dynamicproxy: Fix pointless typo [puppet] - 10https://gerrit.wikimedia.org/r/271133 (owner: 10Yuvipanda) [22:34:49] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Do not try to create accounts on labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/271134 (owner: 10Yuvipanda) [22:34:56] cwd: but, someone should remove payments1001 from the redis pool, in LocalSettings [22:35:18] kk [22:35:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:35:33] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [22:36:25] (03PS3) 10Dzahn: yubiauth: fix inclusion from modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/271130 [22:37:00] (03CR) 10Dzahn: [V: 032] "unfortunately i have to V+2 as well because otherwise i cant get anything merged" [puppet] - 10https://gerrit.wikimedia.org/r/271130 (owner: 10Dzahn) [22:37:12] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [22:38:22] would recommend: a) document who has server access b) make a) match the users in the fundraising-tech-ops group on phab c) add icinga-wm notifications to fundraising channel (only for the FR services, we do the same for others,like wikidata,analytics) d) add icinga contacts for people in a), so they get mail and/or SMS [22:38:55] ragesoss: no, testwiki is part of production (or at least getting deployed the same way) [22:40:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:40:44] (03CR) 10Chad: [C: 031] "Can go whenever" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263000 (owner: 10Anomie) [22:40:48] tgr: thanks, and nevermind. (I meant, not yet deployed to non-test wikis.) I was having an issue with OAuth on test wiki where I'd click 'Allow' but the page would just refresh and throw up the Allow dialog again. But it seems to have been something transient, possibly related to browser state. Not happening any longer. [22:41:48] anomie: Which branch did ApiSandbox go into core in? [22:44:19] ostriches: last weeks [22:44:21] ostriches: wmf.13 [22:44:41] Ah ok [22:44:55] also man is it nice :) [22:44:59] I thought it was earlier :) [22:45:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:56] (03CR) 10Chad: [C: 032] Ensure /proc/cpuinfo exists before read it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265619 (owner: 10Dereckson) [22:47:40] (03Merged) 10jenkins-bot: Ensure /proc/cpuinfo exists before read it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265619 (owner: 10Dereckson) [22:49:10] mutante, I don't know what's wrong with jenkins, the usual suspects (build queue length, disk space) aren't at issue here [22:49:13] need an expert [22:49:48] Everything still not group owned by mwdeploy :( [22:50:09] eg: -r--r--r-- 1 root root 1051 Feb 16 22:48 eb05ce51e4616a463b1f909c21054db2d8d5f6 in .git/objects/c2 [22:50:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:50:35] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2033451 (10ori) I tried to identify shifts in memcached usage trends by examining memcached-keys.log. To control for the re-imaging process, I only considered log li... [22:51:06] 6Operations, 10ops-codfw, 10Traffic: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2033453 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Tue, 16 Feb 2016 14:51:01 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failur... [22:51:19] ostriches: which host is it [22:51:24] tin? [22:51:32] Yep [22:51:33] https://phabricator.wikimedia.org/P2628 :( [22:52:02] all right lemme go do the thing [22:52:04] again [22:52:17] It seems there is a big queue at https://integration.wikimedia.org/zuul/ because rake-jessie is not working. hashar. [22:52:56] https://integration.wikimedia.org/ci/?auto_refresh=false [22:53:06] here it seems like there's not really [22:53:13] got a text from Jeff, he is trying to get on [22:53:18] are we out of nodepool slaves? [22:53:24] * legoktm looks [22:53:52] yay mutante [22:54:25] Feb 16 22:52:09 labnodepool1001 nodepoold[1596]: JenkinsException: Error in request.Possibly authentication failed [500] [22:54:33] ostriches: what group should own those things? [22:54:47] legoktm: ugh [22:54:48] mwdeploy for all in .git? [22:55:00] or all in .git/objects? [22:55:12] it seems jessie is down legoktm. It looks like all tests that are for jessie are not working. Such as integration/config, mediawiki/core and other tests [22:55:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:56:29] paladox: it's building more slaves as we speak, just have to wait a bit [22:56:30] !log contint: Nodepool instances pool exhausted [22:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:42] legoktm: Oh ok. Thanks. [22:56:48] thanks paladox, hashar, legoktm [22:57:09] ostriches: I chgrp -R mwdeploy .git/objects [22:57:20] the stuff in .git othe rthan that has group wikidev [22:57:21] (03PS1) 10Dzahn: yubiauth: adjust class name after move to role module [puppet] - 10https://gerrit.wikimedia.org/r/271139 [22:57:23] is that ok? [22:58:19] !log restbase deploy start of 6f69a30 [22:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:58:47] apergos: They also need g+w [22:58:53] eg: [22:58:56] demon@tin /srv/mediawiki-staging (master)$ ls -la .git/objects/c2/eb05ce51e4616a463b1f909c21054db2d8d5f6 [22:58:56] -r--r--r-- 1 root mwdeploy 1051 Feb 16 22:48 .git/objects/c2/eb05ce51e4616a463b1f909c21054db2d8d5f6 [22:58:57] objects or everything? [22:59:15] Technically everything, but objects is what's messed up [22:59:39] did objects [23:00:04] Eh, modules too because....I hate submodules [23:00:04] .git/rest is wikidev but all group writable [23:00:05] so [23:00:08] * ostriches stabs git a few times [23:00:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:00:38] did +w for modules [23:00:51] Ok wtf is going on with you git. [23:00:52] error: insufficient permission for adding an object to repository database .git/objects [23:01:21] I left them as wikidev group though [23:01:24] Errr, or should it all be wikidev and not mwdeploy. [23:01:29] hah [23:01:30] * ostriches cries a little [23:01:32] ok changing em back [23:02:02] you're in wikidev [23:02:07] so I guess that's the right group [23:02:08] done [23:02:23] ostriches: try again [23:02:48] !log Nodepool can not authenticate with Jenkins anymore. Thus it can not add slaves it spawned. [23:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:58] There we go [23:03:04] fixed? [23:03:27] 6Operations, 10ops-codfw, 10Traffic: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2033469 (10BBlack) Will depool at ~9AM Pacific, 17:00 UTC. [23:04:09] apparently since you now own ORIG_HEAD [23:04:34] !log demon@tin Synchronized w/health-check.php: Ensure procinfo exists (duration: 00m 58s) [23:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:04:50] legoktm: Phpunit is not working in https://integration.wikimedia.org/ci/job/mwext-testextension-php55-non-voting/4/console [23:04:52] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:04:57] (03PS1) 10Chad: Pre-deploy wmf.14 to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271143 [23:05:21] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 256 processes with command name apache2 [23:05:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:53] (03CR) 10Chad: [C: 032] Pre-deploy wmf.14 to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271143 (owner: 10Chad) [23:06:54] paladox: sorry, not sure about that. I can look in a bit. But only the non-voting job is broken? Have you filed a bug for it? [23:07:18] (03Merged) 10jenkins-bot: Pre-deploy wmf.14 to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271143 (owner: 10Chad) [23:08:48] (03PS1) 10Jforrester: Speed trials: Add mobile and desktop versions with OOjs UI core loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271144 (https://phabricator.wikimedia.org/T127125) [23:08:51] legoktm: Hi nope. But i will now. And yes ive only found the non voting jobs are broken. Only the hhvm one i found worked. [23:08:59] !log demon@tin Started scap: pre-deploying wmf.14 to testwiki to prime l10n cache [23:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:54] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2033486 (10hashar) [23:10:13] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2021279 (10hashar) I have created an experimental grafana board for Nutcracker. Might help find some issue with memcached https://grafana.wikimedia.org/dashboard/db/... [23:10:21] PROBLEM - check_apache2 on payments1001 is CRITICAL: PROCS CRITICAL: 257 processes with command name apache2 [23:10:21] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:33] PROBLEM - Disk space on labservices1001 is CRITICAL: DISK CRITICAL - free space: / 340 MB (3% inode=81%) [23:11:41] RECOVERY - check_apache2 on payments1001 is OK: PROCS OK: 6 processes with command name apache2 [23:11:41] RECOVERY - check_payments_wiki on payments1001 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.049 second response time [23:11:56] legoktm: It works now after a recheck. [23:12:12] !log restbase deploy end of 6f69a30 [23:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:03] apergos: cwd: see the RECOVERYs above.. that was Jeff restarting the webserver [23:14:08] ahhh [23:14:13] ori: ^^ [23:14:23] I was just typing the "where is he" message :-) [23:14:27] mutante: whew, thanks! [23:15:15] mobrovac: restbase train ? https://gerrit.wikimedia.org/r/#/c/270481/ :) [23:15:54] (03PS1) 10BBlack: Revert "Revert "Revert "ttl_fixed: limit to tier-1 backends""" [puppet] - 10https://gerrit.wikimedia.org/r/271146 [23:16:01] (03PS2) 10BBlack: Revert "Revert "Revert "ttl_fixed: limit to tier-1 backends""" [puppet] - 10https://gerrit.wikimedia.org/r/271146 [23:16:09] (03CR) 10BBlack: [C: 032 V: 032] Revert "Revert "Revert "ttl_fixed: limit to tier-1 backends""" [puppet] - 10https://gerrit.wikimedia.org/r/271146 (owner: 10BBlack) [23:16:13] mutante: lgtm want to do it now? [23:16:37] mobrovac: i'm not personally invested, i just said it should probably wait for the next restbase deploy [23:16:52] mobrovac: because the restarts [23:17:19] mutante: yeah, let's do it now [23:17:19] !log Jenkins accepting slave creations again. Root cause is /var/lib/jenkins/config-history/nodes/ has reached the 32k inode limit. [23:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:25] mobrovac: cool [23:17:36] (03PS2) 10Dzahn: RESTBase configuration for pt.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/270481 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [23:17:42] (03CR) 10Dzahn: [C: 032] RESTBase configuration for pt.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/270481 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [23:18:28] (03CR) 10Dzahn: [V: 032] RESTBase configuration for pt.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/270481 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [23:19:07] mutante: let me know once it's landed on the puppet master [23:19:33] mobrovac: it just did [23:19:39] kk [23:19:48] * mobrovac forcing puppet in prod [23:22:46] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2033550 (10ori) [23:24:41] (03PS2) 10Dzahn: ADD mgmt DNS entries for es201[1-9] Bug:T126006 [dns] - 10https://gerrit.wikimedia.org/r/270325 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [23:24:47] (03CR) 10Dzahn: [C: 032] ADD mgmt DNS entries for es201[1-9] Bug:T126006 [dns] - 10https://gerrit.wikimedia.org/r/270325 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [23:24:57] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/270325 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [23:26:26] inodes! ouch [23:26:29] 6Operations, 10ops-codfw: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2033577 (10Dzahn) mgmt entries have been added to DNS https://gerrit.wikimedia.org/r/#/c/270325/ [23:28:11] (03PS3) 10Dzahn: keyholder: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269611 [23:30:02] (03CR) 10Dzahn: [C: 032] keyholder: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269611 (owner: 10Dzahn) [23:30:18] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269611 (owner: 10Dzahn) [23:30:20] * apergos signs off. tah [23:31:39] (03PS1) 10Dduvall: Program Dashboard configuration for initial labs rollout [puppet] - 10https://gerrit.wikimedia.org/r/271149 (https://phabricator.wikimedia.org/T105967) [23:33:09] mutante: {{done}} [23:33:29] mobrovac_: great, thx [23:33:45] np [23:35:17] (03PS2) 10Dzahn: haproxy: fix 12 lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/269905 [23:35:43] (03CR) 10Dzahn: [C: 032] "alignment only, but fixes compiler warnings" [puppet] - 10https://gerrit.wikimedia.org/r/269905 (owner: 10Dzahn) [23:36:02] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269905 (owner: 10Dzahn) [23:36:06] (03PS2) 10Dduvall: Program Dashboard configuration for initial labs rollout [puppet] - 10https://gerrit.wikimedia.org/r/271149 (https://phabricator.wikimedia.org/T105967) [23:39:03] (03CR) 10jenkins-bot: [V: 04-1] Program Dashboard configuration for initial labs rollout [puppet] - 10https://gerrit.wikimedia.org/r/271149 (https://phabricator.wikimedia.org/T105967) (owner: 10Dduvall) [23:39:36] !log demon@tin Finished scap: pre-deploying wmf.14 to testwiki to prime l10n cache (duration: 30m 36s) [23:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:23] (03PS2) 10Dzahn: scap: fix lint warnings, alignment [puppet] - 10https://gerrit.wikimedia.org/r/269906 [23:41:05] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269906 (owner: 10Dzahn) [23:48:01] (03CR) 10Dzahn: [C: 032] scap: fix lint warnings, alignment [puppet] - 10https://gerrit.wikimedia.org/r/269906 (owner: 10Dzahn)