[00:40:57] (03CR) 10Legoktm: [C: 031] wiki replicas: Add spamblacklist to allowed log types [puppet] - 10https://gerrit.wikimedia.org/r/418710 (https://phabricator.wikimedia.org/T184483) (owner: 10BryanDavis) [01:32:58] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [01:43:38] PROBLEM - cassandra-a SSL 10.64.16.188:7001 on praseodymium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:43:39] PROBLEM - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: connect to address 10.64.16.188 and port 9042: Connection refused [01:43:49] PROBLEM - Restbase root url on praseodymium is CRITICAL: connect to address 10.64.16.149 and port 7231: Connection refused [01:50:48] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [01:51:28] PROBLEM - Nginx local proxy to apache on mw1302 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.005 second response time [01:51:48] RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.056 second response time [01:52:28] RECOVERY - Nginx local proxy to apache on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time [02:52:14] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.24) (duration: 11m 56s) [02:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:18] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 767.18 seconds [04:05:39] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 290.42 seconds [04:12:33] 10Operations, 10Analytics, 10netops: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041596 (10ayounsi) a:03ayounsi 1st change applied. Waiting for confirmation for the 2nd. [04:14:55] (03CR) 10Liuxinyu970226: [C: 031] Switch public wikis to explicit Flow usage definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416217 (https://phabricator.wikimedia.org/T188812) (owner: 10Nemo bis) [04:22:45] (03PS1) 10Gergő Tisza: Enable Wikidata description override on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) [04:36:47] (03CR) 10Liuxinyu970226: [C: 031] Disable Flow extension on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [05:37:17] 10Operations, 10Analytics, 10Traffic, 10HTTPS: Update documentation for "https" field in X-Analytics - https://phabricator.wikimedia.org/T188807#4041656 (10Tbayer) [05:40:35] 10Operations, 10Analytics, 10Traffic, 10HTTPS: Update documentation for "https" field in X-Analytics - https://phabricator.wikimedia.org/T188807#4041659 (10Tbayer) @BBlack Thanks again! Back to the task at hand: I have tentatively updated the documentation based on my understanding of your remarks: https:/... [05:45:31] (03CR) 10Brian Wolff: "Please be advised, gerrit is neither a democracy nor a place to make political decisions, and your +1's here are meaningless." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [05:52:16] (03CR) 10MZMcBride: [C: 031] Disable Flow extension on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [06:12:09] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4041667 (10bmansurov) [06:19:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418846 (https://phabricator.wikimedia.org/T187089) [06:21:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418846 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:22:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418846 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:22:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418846 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [06:25:00] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 for alter table (duration: 01m 06s) [06:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:28] !log Deploy schema change on db1103:3314 - T187089 T185128 T153182 [06:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:35] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [06:27:35] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [06:27:36] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [06:37:03] (03CR) 10Marostegui: [C: 031] Add support to global query execution limit [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418593 (owner: 10Jcrespo) [06:47:28] (03PS1) 10Marostegui: db-eqiad.php: Pool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418847 (https://phabricator.wikimedia.org/T184161) [06:48:33] ACKNOWLEDGEMENT - Restbase root url on cerium is CRITICAL: connect to address 10.64.16.147 and port 7231: Connection refused Giuseppe Lavagetto being decommissioned. [06:48:33] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.16.153:9042 on cerium is CRITICAL: connect to address 10.64.16.153 and port 9042: Connection refused Giuseppe Lavagetto being decommissioned. [06:48:33] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.16.153:7001 on cerium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Giuseppe Lavagetto being decommissioned. [06:48:33] ACKNOWLEDGEMENT - Restbase root url on praseodymium is CRITICAL: connect to address 10.64.16.149 and port 7231: Connection refused Giuseppe Lavagetto being decommissioned. [06:48:33] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.16.188:9042 on praseodymium is CRITICAL: connect to address 10.64.16.188 and port 9042: Connection refused Giuseppe Lavagetto being decommissioned. [06:48:33] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.16.188:7001 on praseodymium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Giuseppe Lavagetto being decommissioned. [06:48:33] ACKNOWLEDGEMENT - Restbase root url on xenon is CRITICAL: connect to address 10.64.0.200 and port 7231: Connection refused Giuseppe Lavagetto being decommissioned. [06:48:34] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: connect to address 10.64.0.202 and port 9042: Connection refused Giuseppe Lavagetto being decommissioned. [06:48:34] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.202:7001 on xenon is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Giuseppe Lavagetto being decommissioned. [06:49:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418847 (https://phabricator.wikimedia.org/T184161) (owner: 10Marostegui) [06:50:13] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418847 (https://phabricator.wikimedia.org/T184161) (owner: 10Marostegui) [06:50:27] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418847 (https://phabricator.wikimedia.org/T184161) (owner: 10Marostegui) [06:54:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1113:3315 as vslow,dump in s5 - T184161 (duration: 00m 58s) [06:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:43] T184161: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161 [07:04:05] (03PS1) 10Marostegui: db-eqiad.php: Pool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418848 (https://phabricator.wikimedia.org/T184161) [07:06:44] (03PS2) 10Marostegui: db-eqiad.php: Pool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418848 (https://phabricator.wikimedia.org/T184161) [07:08:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418848 (https://phabricator.wikimedia.org/T184161) (owner: 10Marostegui) [07:08:49] (03CR) 10Chad: [V: 032 C: 032] Update bazlets to upstream (includes fix for python 3) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/417713 (owner: 10Paladox) [07:09:28] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [07:09:37] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418848 (https://phabricator.wikimedia.org/T184161) (owner: 10Marostegui) [07:09:38] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:09:51] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1113:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418848 (https://phabricator.wikimedia.org/T184161) (owner: 10Marostegui) [07:10:39] <_joe_> uhm what the heck is happening on gerrit2001? [07:10:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1113:3316 as vslow,dump in s6 - T184161 (duration: 00m 58s) [07:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:59] T184161: Productionize 2 new eqiad database servers - https://phabricator.wikimedia.org/T184161 [07:11:53] _joe_ I think that the downtime expired, iirc there is a task about gerrit2001 [07:12:12] yeah, that is expired I think too [07:12:30] https://phabricator.wikimedia.org/T176532 [07:12:34] https://phabricator.wikimedia.org/T176532 [07:12:34] :) [07:12:48] * marostegui 1 - elukey 0 (cumin) [07:12:55] <_joe_> Active: failed (Result: exit-code) since Mon 2018-03-12 07:07:07 UTC; 4min 9s ago [07:13:25] <_joe_> so this system is broken since october? [07:13:27] <_joe_> wow [07:14:40] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1113 and db1114 - https://phabricator.wikimedia.org/T182896#4041731 (10Marostegui) [07:15:15] * elukey celebrates marostegui's victory (cumin cumin) [07:16:16] <_joe_> I gather you're adding "cumin" to your sentences just to annoy volans since he has an highlight on "cumin"? [07:18:32] 10Operations, 10Availability (Multiple-active-datacenters), 10DC-Switchover-Prep-Q3-2016-17, 10Epic: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#4041733 (10Joe) [07:18:42] 10Operations, 10Availability (Multiple-active-datacenters), 10Patch-For-Review, 10Performance-Team (Radar), and 4 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#4041732 (10Joe) 05Open>03Resolved [07:20:37] (03PS1) 10Marostegui: db-codfw.php: Depool es2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418849 [07:22:18] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool es2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418849 (owner: 10Marostegui) [07:23:29] (03Merged) 10jenkins-bot: db-codfw.php: Depool es2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418849 (owner: 10Marostegui) [07:23:44] (03CR) 10jenkins-bot: db-codfw.php: Depool es2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418849 (owner: 10Marostegui) [07:24:52] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool es2014 for kernel upgrade (duration: 00m 58s) [07:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:06] !log Stop MySQL on es2014 for kernel upgrade [07:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:50] _joe_ we wouldn't dare to do such a bad thing, you have a very low consideration of me and marostegui :D [07:27:21] <_joe_> elukey: you just happen to love cumin so much, you put it in most of your sentences [07:27:30] <_joe_> not just in your food [07:27:39] exactly [07:27:52] * elukey sends wikilove to volans [07:29:15] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool es2014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418850 [07:30:48] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [07:30:51] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [07:32:06] 10Operations: Integrate stretch 9.4 point update - https://phabricator.wikimedia.org/T189435#4041738 (10MoritzMuehlenhoff) [07:32:09] downtiming again gerrit2001 otherwise it will spam a lot [07:32:13] 10Operations: Integrate stretch 9.4 point update - https://phabricator.wikimedia.org/T189435#4041749 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:32:27] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool es2014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418850 (owner: 10Marostegui) [07:32:34] <_joe_> elukey: wait [07:33:02] are you working on it? [07:33:05] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#3628916 (10Joe) Is anyone working on this issue? @Dzahn @jcrespo if neither of you is wo... [07:33:07] <_joe_> elukey: did you add a comment referencing the ticket? [07:33:38] _joe_ still haven't done it, I was planning to [07:34:01] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool es2014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418850 (owner: 10Marostegui) [07:34:46] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4041753 (10Marostegui) The proxies for codfw have been budget but not yet ordered, so rig... [07:35:41] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool es2014 for kernel upgrade (duration: 00m 59s) [07:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:03] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool es2014" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418850 (owner: 10Marostegui) [07:38:21] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool es2014 after kernel upgrade (duration: 01m 01s) [07:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:49] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [07:39:58] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:41:22] downtimed --^ [07:41:23] (03PS1) 10Marostegui: db-codfw.php: Depool es2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418851 [07:43:19] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool es2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418851 (owner: 10Marostegui) [07:44:31] (03Merged) 10jenkins-bot: db-codfw.php: Depool es2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418851 (owner: 10Marostegui) [07:45:47] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool es2015 for kernel upgrade (duration: 00m 58s) [07:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:10] (03CR) 10jenkins-bot: db-codfw.php: Depool es2015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418851 (owner: 10Marostegui) [07:52:23] hi. does anybody know if there are chaching problems in europa (aka esams)? [07:52:51] I hear multiple people complain about random slowdowns or running into errors [07:53:11] 503's, that is [07:55:37] Hi Wiki13, yep we might have the same issue that happened during the weekend, I am seeing 503s now registered as well in logstash metrics [07:55:56] ókay good to know [07:57:31] thanks for the report! The traffic team is going to check very soon, will report news in here as soon as we have them [08:01:08] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [08:01:18] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [08:03:23] !log cp3042: set transaction_timeout to 30s [08:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:07] !log cp3042: restart varnish-be [08:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:21] !log Stop MySQL on es2015 for kernel upgrade [08:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:54] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool es2015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418863 [08:19:09] (03PS1) 10Marostegui: db1113.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/418865 [08:19:56] (03CR) 10Marostegui: [C: 032] db1113.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/418865 (owner: 10Marostegui) [08:20:53] !log cp3033/cp3031: set transaction_timeout to 60s [08:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:48] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool es2015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418863 (owner: 10Marostegui) [08:23:03] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool es2015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418863 (owner: 10Marostegui) [08:23:14] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool es2015" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418863 (owner: 10Marostegui) [08:24:47] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool es2015 after kernel upgrade (duration: 00m 58s) [08:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:57] !log cp3033/cp3031: restart varnish-be [08:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:19] (03CR) 10Ayounsi: [C: 031] Fix eventlog1002's ipv6 address [dns] - 10https://gerrit.wikimedia.org/r/418714 (https://phabricator.wikimedia.org/T185667) (owner: 10Elukey) [08:39:22] (03CR) 10WMDE-leszek: [C: 031] Enable Wikidata description override on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [08:40:38] !log rebooting iron for kernel security update [08:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:59] (03PS3) 10Giuseppe Lavagetto: scap_source: also execute scap deploy --init [puppet] - 10https://gerrit.wikimedia.org/r/389473 [08:42:58] (03CR) 10Giuseppe Lavagetto: [C: 032] scap_source: also execute scap deploy --init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/389473 (owner: 10Giuseppe Lavagetto) [08:44:36] 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#4041841 (10akosiaris) `cache=none` tests during the weekend showed no problems. I 'll find a quiet point in time during the day and restart all VMs in cluster with that setting set. Th... [08:46:10] <_joe_> tin's puppet failure is my fault [08:46:49] (03PS1) 10Vgutierrez: Disable journald messages' rate limiting [debs/pybal] - 10https://gerrit.wikimedia.org/r/418866 (https://phabricator.wikimedia.org/T189290) [08:47:18] <_joe_> and ofc, how asinine [08:47:55] (03CR) 10Elukey: [C: 032] Fix eventlog1002's ipv6 address [dns] - 10https://gerrit.wikimedia.org/r/418714 (https://phabricator.wikimedia.org/T185667) (owner: 10Elukey) [08:48:39] Wiki13: the situation should have improved now, thanks for your patience! [08:49:07] nice. hopefully the problem stays away now :) [08:49:25] (03CR) 10Muehlenhoff: [C: 031] Disable journald messages' rate limiting [debs/pybal] - 10https://gerrit.wikimedia.org/r/418866 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [08:49:47] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 43 failures. Last run 3 minutes ago with 43 failures. Failed resources (up to 3 shown): Scap_source[3d2png/deploy],Scap_source[analytics/refinery],Scap_source[changeprop/deploy],Scap_source[citoid/deploy] [08:57:42] (03PS1) 10Marostegui: db1051,db1063: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/418867 (https://phabricator.wikimedia.org/T183469) [08:58:15] (03PS2) 10Marostegui: db1051,db1063: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/418867 (https://phabricator.wikimedia.org/T183469) [08:58:43] (03CR) 10Marostegui: [C: 032] db1051,db1063: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/418867 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [08:59:00] <_joe_> vgutierrez, moritzm are we sure disabling the rate-limiting completely is a good idea? [08:59:09] <_joe_> journald resides in memory IIRC [08:59:16] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 43 failures. Last run 2 minutes ago with 43 failures. Failed resources (up to 3 shown): Scap_source[3d2png/deploy],Scap_source[analytics/refinery],Scap_source[changeprop/deploy],Scap_source[citoid/deploy] [08:59:16] <_joe_> I'd rather raise the limit by 10x [08:59:53] (03PS3) 10Giuseppe Lavagetto: systemd: add define specific to timers [puppet] - 10https://gerrit.wikimedia.org/r/417948 [08:59:55] (03PS1) 10Giuseppe Lavagetto: scap_source: fix two errors [puppet] - 10https://gerrit.wikimedia.org/r/418868 [09:00:34] (03PS2) 10Giuseppe Lavagetto: scap_source: fix two errors [puppet] - 10https://gerrit.wikimedia.org/r/418868 [09:01:22] (03CR) 10Giuseppe Lavagetto: [C: 032] scap_source: fix two errors [puppet] - 10https://gerrit.wikimedia.org/r/418868 (owner: 10Giuseppe Lavagetto) [09:04:01] (03PS1) 10Muehlenhoff: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418870 [09:04:47] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:05:52] 10Operations, 10Analytics, 10netops: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041924 (10elukey) [09:12:39] 10Operations, 10Analytics, 10netops: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041928 (10elukey) Since we are doing some cleanups, I'd also like to review the following: ``` term mysql { from { destination-address { 10... [09:13:10] 10Operations, 10Analytics, 10netops: Review some IPs in the analytics-in4 filter - https://phabricator.wikimedia.org/T189408#4041932 (10elukey) [09:17:22] 10Operations, 10ops-codfw, 10hardware-requests, 10User-Elukey: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4041939 (10Joe) @Papaul thanks, doing it now! [09:17:27] 10Operations, 10ops-codfw, 10hardware-requests, 10User-Elukey, 10User-Joe: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4041940 (10Joe) [09:21:18] (03PS1) 10Gehel: wdqs: reactivate kafka poller [puppet] - 10https://gerrit.wikimedia.org/r/418872 [09:22:01] (03CR) 10Elukey: [C: 031] wdqs: reactivate kafka poller [puppet] - 10https://gerrit.wikimedia.org/r/418872 (owner: 10Gehel) [09:24:07] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4041969 (10Vgutierrez) It looks like journald messages' rate limiting is not configurable per unit. So it needs to be done system-wide. Even worse, in Debian jessi... [09:24:18] (03PS1) 10Gehel: Add consumer ID to Updater launch string [puppet] - 10https://gerrit.wikimedia.org/r/418873 (https://phabricator.wikimedia.org/T188716) [09:29:07] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:29:20] (03PS1) 10Giuseppe Lavagetto: codfw: decommission mw2097-mw2134 [puppet] - 10https://gerrit.wikimedia.org/r/418874 (https://phabricator.wikimedia.org/T189111) [09:31:35] <_joe_> !log decommission mw2097-mw2134 from conftool T189111 [09:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:44] T189111: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111 [09:32:58] (03PS1) 10Urbanecm: New throttle rule, clean expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418875 (https://phabricator.wikimedia.org/T189442) [09:36:32] (03CR) 10Gehel: [C: 032] wdqs: reactivate kafka poller [puppet] - 10https://gerrit.wikimedia.org/r/418872 (owner: 10Gehel) [09:47:08] (03PS1) 10Gehel: wdqs: deactivate kafka poller [puppet] - 10https://gerrit.wikimedia.org/r/418876 [09:47:44] (03CR) 10Gehel: [C: 032] wdqs: deactivate kafka poller [puppet] - 10https://gerrit.wikimedia.org/r/418876 (owner: 10Gehel) [09:53:52] !log installing util-linux security updates [09:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:24] !log restart kafka mirror maker (main eqiad -> jumbo) on kafka1020 (all consumers not assigned to any partition on kafka102*) [09:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:31] gehel: --^ [09:56:32] super wierd [09:56:35] *weird [09:56:57] now metrics are recovering [09:57:07] I'll have a chat with Andrew once he'll be online [09:57:13] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10393/mw2134.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/418874 (https://phabricator.wikimedia.org/T189111) (owner: 10Giuseppe Lavagetto) [09:57:23] (03PS2) 10Giuseppe Lavagetto: codfw: decommission mw2097-mw2134 [puppet] - 10https://gerrit.wikimedia.org/r/418874 (https://phabricator.wikimedia.org/T189111) [10:00:15] elukey: I'll wait for your chat before re-enabling wdqs kafka poller ... [10:03:31] yes definitely [10:08:30] (03PS6) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) [10:08:54] (03CR) 10Liuxinyu970226: [C: 031] "> Please be advised, gerrit is neither a democracy nor a place to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [10:09:10] 10Operations, 10ops-codfw, 10netops: Interface errors on cr2-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T189452#4042196 (10ayounsi) [10:12:15] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#4042213 (10Gehel) [10:12:18] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for wdqs-updater - https://phabricator.wikimedia.org/T182773#4042211 (10Gehel) 05Open>03Resolved Migration to prometheus is completed, dashboards have been updated and diamond / gra... [10:23:50] !log labs->cloud vlan rename in eqiad - T187933 [10:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:55] T187933: Labs to Cloud renaming for networking equipment - https://phabricator.wikimedia.org/T187933 [10:30:12] (03PS2) 10Alexandros Kosiaris: scap::target: Install git-lfs [puppet] - 10https://gerrit.wikimedia.org/r/417226 (https://phabricator.wikimedia.org/T180628) [10:31:21] (03PS1) 10Giuseppe Lavagetto: codfw: remove stale references to mw2118-9 [puppet] - 10https://gerrit.wikimedia.org/r/418880 (https://phabricator.wikimedia.org/T189111) [10:31:28] (03CR) 10Alexandros Kosiaris: [C: 032] scap::target: Install git-lfs [puppet] - 10https://gerrit.wikimedia.org/r/417226 (https://phabricator.wikimedia.org/T180628) (owner: 10Alexandros Kosiaris) [10:31:37] <_joe_> you merge-sniped me! [10:32:00] (03PS2) 10Giuseppe Lavagetto: codfw: remove stale references to mw2118-9 [puppet] - 10https://gerrit.wikimedia.org/r/418880 (https://phabricator.wikimedia.org/T189111) [10:32:28] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#4042349 (10ArielGlenn) @RobH ping? [10:36:02] (03CR) 10Giuseppe Lavagetto: [C: 032] codfw: remove stale references to mw2118-9 [puppet] - 10https://gerrit.wikimedia.org/r/418880 (https://phabricator.wikimedia.org/T189111) (owner: 10Giuseppe Lavagetto) [10:39:12] <_joe_> !log running decommission_appserver on mw2097-2134 T189111 [10:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:18] T189111: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111 [10:40:29] (03CR) 10Filippo Giunchedi: [C: 031] Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418870 (owner: 10Muehlenhoff) [10:40:46] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4042372 (10Joe) [10:43:15] (03PS3) 10Jcrespo: Initial commit of existent python scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/354206 [10:43:23] (03CR) 10Jcrespo: [V: 032 C: 032] Initial commit of existent python scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/354206 (owner: 10Jcrespo) [10:43:41] (03CR) 10Jcrespo: [V: 032 C: 032] wmfmariadbpy: remove labsdb1001 & labsdb1003 special behavior [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/408507 (owner: 10Jcrespo) [10:43:44] 10Operations, 10Cloud-Services, 10netops: Labs to Cloud renaming for networking equipment - https://phabricator.wikimedia.org/T187933#4042381 (10ayounsi) 05Open>03Resolved [10:43:50] (03PS2) 10Jcrespo: wmfmariadbpy: remove labsdb1001 & labsdb1003 special behavior [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/408507 [10:43:51] (03CR) 10Jcrespo: [V: 032 C: 032] wmfmariadbpy: remove labsdb1001 & labsdb1003 special behavior [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/408507 (owner: 10Jcrespo) [10:44:04] (03PS2) 10Jcrespo: Add support to global query execution limit [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418593 [10:44:12] (03CR) 10Jcrespo: [V: 032 C: 032] Add support to global query execution limit [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418593 (owner: 10Jcrespo) [10:44:34] (03PS2) 10Jcrespo: Add script for dumping and recovering database sections [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418595 [10:44:47] (03CR) 10Jcrespo: [V: 032 C: 032] Add script for dumping and recovering database sections [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/418595 (owner: 10Jcrespo) [10:45:11] (03Abandoned) 10Filippo Giunchedi: wmflib: support segmented keys in Hiera 3 [puppet] - 10https://gerrit.wikimedia.org/r/415896 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [10:45:40] (03Abandoned) 10Filippo Giunchedi: hiera: port nuyaml to hiera 3 [puppet] - 10https://gerrit.wikimedia.org/r/402346 (owner: 10Giuseppe Lavagetto) [10:49:56] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4042390 (10Joe) @Papaul I did the part I can do myself, I was told in the past not to do things in step 2 without coordination with dc-ops, as that messes... [10:49:59] (03CR) 10Jcrespo: [C: 032] mariadb: Allow recovering arbitrary backups by providing a path [puppet] - 10https://gerrit.wikimedia.org/r/417876 (owner: 10Jcrespo) [10:50:09] (03PS4) 10Jcrespo: mariadb: Allow recovering arbitrary backups by providing a path [puppet] - 10https://gerrit.wikimedia.org/r/417876 [10:50:11] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4042392 (10Joe) a:05Joe>03RobH [10:54:48] (03PS2) 10Jcrespo: compare.py: Implement parallel queries between servers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404472 [10:54:59] (03CR) 10Jcrespo: [V: 032 C: 032] compare.py: Implement parallel queries between servers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404472 (owner: 10Jcrespo) [10:55:06] (03PS6) 10Jcrespo: compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647 [10:55:14] (03CR) 10Jcrespo: [V: 032 C: 032] compare.py: Implement progress reporting, more than 2 servers comp. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/404647 (owner: 10Jcrespo) [11:00:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180312T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:02:07] (03PS2) 10Filippo Giunchedi: Enable grafana alerts for jobqueue-eventbus dashboard. [puppet] - 10https://gerrit.wikimedia.org/r/416740 (https://phabricator.wikimedia.org/T189038) (owner: 10Ppchelko) [11:03:03] (03CR) 10Filippo Giunchedi: [C: 032] Enable grafana alerts for jobqueue-eventbus dashboard. [puppet] - 10https://gerrit.wikimedia.org/r/416740 (https://phabricator.wikimedia.org/T189038) (owner: 10Ppchelko) [11:06:03] (03CR) 10Muehlenhoff: [C: 032] Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418870 (owner: 10Muehlenhoff) [11:08:17] (03CR) 10jenkins-bot: Depool poolcounter1002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418870 (owner: 10Muehlenhoff) [11:09:28] (03CR) 10Filippo Giunchedi: naggen2: add support for puppetdb v4 settings and api (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) (owner: 10Herron) [11:10:12] !log jmm@tin Synchronized wmf-config/ProductionServices.php: depooling poolcounter1002 for kernel security update (duration: 03m 09s) [11:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:13] !log reboot poolcounter1002 for kernel security update [11:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:00] (03PS1) 10Muehlenhoff: Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418886 [11:22:19] (03CR) 10Muehlenhoff: [C: 032] Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418886 (owner: 10Muehlenhoff) [11:22:35] (03CR) 10jenkins-bot: Revert "Depool poolcounter1002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418886 (owner: 10Muehlenhoff) [11:26:26] !log jmm@tin Synchronized wmf-config/ProductionServices.php: Repooling poolcounter1002 after kernel security update (duration: 03m 09s) [11:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:17] PROBLEM - puppet last run on dbproxy1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tzdata],Package[bsdutils] [11:46:00] 10Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4042645 (10faidon) [11:56:03] (03CR) 10Deskana: "Liuxinyu970226 - Your last comment is unacceptable. Please don't do it again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [11:57:01] ^dbproxy1005 is a side effect of util-linux update, should recover soonish [11:59:17] RECOVERY - puppet last run on dbproxy1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:05:16] (03PS1) 10Muehlenhoff: Depool poolcounter1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418892 [12:09:45] (03PS1) 10Arturo Borrero Gonzalez: apt: apt-upgrade: capture exception when creating cache [puppet] - 10https://gerrit.wikimedia.org/r/418893 (https://phabricator.wikimedia.org/T181647) [12:10:02] (03CR) 10Nikerabbit: [C: 031] ContentTranslation: Set cookieDomain to null for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 (owner: 10KartikMistry) [12:12:11] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrade: capture exception when creating cache [puppet] - 10https://gerrit.wikimedia.org/r/418893 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [12:13:02] (03PS1) 10Volans: Mariadb backups: include standard [puppet] - 10https://gerrit.wikimedia.org/r/418894 [12:13:23] (03PS1) 10Arturo Borrero Gonzalez: apt: apt-upgrade: fix typo in comment for documentation [puppet] - 10https://gerrit.wikimedia.org/r/418895 (https://phabricator.wikimedia.org/T181647) [12:14:12] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrade: fix typo in comment for documentation [puppet] - 10https://gerrit.wikimedia.org/r/418895 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [12:15:23] (03CR) 10Volans: "Compiler diff: https://puppet-compiler.wmflabs.org/compiler02/10394/es2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/418894 (owner: 10Volans) [12:16:31] (03CR) 10Jcrespo: "I think this needs a role name (issue) and a firewall, but I have to check if firewall is enabled on the profiles." [puppet] - 10https://gerrit.wikimedia.org/r/418894 (owner: 10Volans) [12:18:03] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10hardware-requests, 10procurement: eqiad: (4) systems for CirrusSearch Elasticssearch replica service - https://phabricator.wikimedia.org/T187627#4042739 (10faidon) p:05Triage>03High @RobH, ping? Let's move on this soon, it has been waiting for too l... [12:18:38] (03CR) 10Filippo Giunchedi: [C: 031] Depool poolcounter1001 for kernel update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418892 (owner: 10Muehlenhoff) [12:20:05] (03CR) 10Hashar: [C: 04-1] "The IP are already being logged and made public via logstash https://logstash-beta.wmflabs.org/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [12:20:42] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Proposal for moving hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) [12:21:03] jynus: ^ [12:21:39] (03CR) 10Marostegui: [C: 04-2] "Do not submit, it is an example" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418898 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [12:21:45] (03PS2) 10Volans: Mariadb backups: improve role puppettization [puppet] - 10https://gerrit.wikimedia.org/r/418894 [12:22:39] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/418894 (owner: 10Volans) [12:24:30] (03CR) 10Volans: "New compiler results:" [puppet] - 10https://gerrit.wikimedia.org/r/418894 (owner: 10Volans) [12:24:32] jynus: updated, but feel free to amend it yourself if you want a different description of the role or anything else [12:28:00] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/418894 (owner: 10Volans) [12:44:20] !log start a catalog compilation on elnath to check for puppetdb4 diffs - T177253 [12:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:25] T177253: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253 [12:46:27] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4042863 (10Joe) Instead of buying more hardware (specifically: 1 server per dc) we should reshuffle things so that we have more videoscaling capacity (that is - reassign... [12:46:51] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, and 2 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4042866 (10DYNKM) Heyo; anything new on this? [12:56:19] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4042934 (10Joe) I'd take the chance we have to do this to do as follows: # Add `mediawiki::multimedia` to the jobrunners # Add a second LVS I... [12:57:18] (03CR) 10Filippo Giunchedi: "jenkins fails with wmf-style violation, worth adding an explicit silence until the conversion to role/profile is completed:" [puppet] - 10https://gerrit.wikimedia.org/r/415328 (owner: 10Muehlenhoff) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 8 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180312T1300). [13:00:05] Lucas_WMDE, marlier, revi, and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] \o/ [13:00:12] I’m here :) [13:00:15] rockin [13:00:23] I can SWAT today [13:00:54] Lucas_WMDE, marlier, revi, and Urbanecm: the usual question, do you want to deploy yourself? (if you can) [13:00:56] fyi, there’s some risk my change might need to be rolled back… I have no idea if that means it would be better to have it earlier or later in SWAT :) [13:01:01] I have no permission :P [13:01:05] I’ll keep an eye on Grafana in any case [13:01:08] and no permission, yeah :) [13:01:25] Lucas_WMDE, marlier, revi, and Urbanecm: also, is there anything I should know about your patch, needs a script to run, takes a long time to test, it's risky... [13:01:47] none I'm aware of, for today [13:02:13] zeljkof: mine is slightly risky, though I think the bugs that led to the incident last time are thoroughly fixed (it was an odd combination of edge cases) [13:02:17] zeljkof: my patch is very low risk -- making a config change that had been live only on testwiki into the default for all wikis. [13:03:02] Lucas_WMDE: you are first then; marlier, revi, and Urbanecm: please stand by, your patch is important to us ;) [13:03:14] :D [13:03:16] thanks [13:03:17] ;) [13:03:23] (03PS2) 10Zfilipin: Enable caching of constraint check results [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416748 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [13:04:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416748 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [13:05:17] Lucas_WMDE, marlier, revi, and Urbanecm: your patches will be available for testing at mwdebug1002 before deployment, please let me know if your patc(es) can not be tested there, or if you do not know how to test there [13:05:46] Sounds good, I'm already configured for testing against that host. [13:05:49] <_joe_> zeljkof: did SWAT moved 1 hour earlier? [13:05:51] I’m ready to test mine [13:06:05] (03Merged) 10jenkins-bot: Enable caching of constraint check results [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416748 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [13:06:16] <_joe_> *move, even [13:06:19] _joe_: US changed time zones, and deployment windows are locked to SF time... [13:06:24] <_joe_> ooo right [13:06:31] <_joe_> the ops meeting too, I have to remember [13:06:38] <_joe_> thanks jouncebot for reminding me [13:06:42] all the meetings :| [13:07:52] Lucas_WMDE: patch is merged, it's at deployment server, syncing it to mwdebug now, the first sync usually takes a few minutes... [13:08:02] ok [13:08:26] (03CR) 10jenkins-bot: Enable caching of constraint check results [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416748 (https://phabricator.wikimedia.org/T184812) (owner: 10Lucas Werkmeister (WMDE)) [13:11:16] Lucas_WMDE: the patch is at mwdebug, please test and let me know if I can deploy [13:12:10] (03PS2) 10Zfilipin: wmf-config: enable Singapore oversample as default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [13:13:10] zeljkof: do mwdebug requests use the same database servers as regular ones? [13:13:19] marlier: please stand by, you are next [13:13:24] Standing by [13:13:24] Lucas_WMDE: Yes, lol [13:13:28] okay, thanks :) [13:13:43] just wanted to know if I was looking at the right Grafana board ^^ [13:13:48] zeljkof: looks good [13:13:55] I think you can go ahead [13:13:59] (03CR) 10Filippo Giunchedi: Decom restbase-test cluster and role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) (owner: 10Filippo Giunchedi) [13:14:08] Lucas_WMDE: ok, deploying [13:15:34] _joe_: nobody is on ops clinic duty? [13:16:23] hm, just got this during scap :| `on mw2270.codfw.wmnet returned [255]: ssh: Could not resolve hostname mw2270.codfw.wmnet: Name or service not known` [13:16:27] (03PS2) 10Filippo Giunchedi: Decom restbase-test cluster and role [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) [13:16:45] and sync-apaches is taking forever... [13:17:07] zeljkof: fixed [13:17:16] akosiaris: thanks! [13:17:22] !log zfilipin@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 03m 09s) [13:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:30] T184812: Enable constraint result caching on Wikidata - https://phabricator.wikimedia.org/T184812 [13:17:48] vgutierrez: ^. Per the SRE meeting notes. I see ema and arzhel will be helping. I can help too [13:17:48] vgutierrez: scap trouble :( `13:17:22 36 apaches had sync errors` [13:18:13] um, just to make sure… is someone aware of the rising s4 lag plotted here https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&panelId=6&fullscreen&from=1520838988875&to=1520860588875&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [13:18:18] first `on mw2270.codfw.wmnet returned [255]: ssh: Could not resolve hostname mw2270.codfw.wmnet: Name or service not known` [13:19:17] there is no mw2270 [13:19:23] Lucas_WMDE: that seems to have started earlier than swat, right? [13:19:33] yeah, I don’t think it’s related, I just happened to see it [13:19:47] akosiaris: scap somehow thinks there is, what to do? :( [13:20:00] zeljkof: I am looking into it [13:20:06] akosiaris: thanks! [13:20:19] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4042975 (10faidon) a:05faidon>03Papaul Thanks for taking care of this before your trip! I checked this out last week, and it seemed then (and now that I double-checked it) that only three she... [13:20:43] zeljkof: but per https://phabricator.wikimedia.org/T188301 it should not be there [13:20:52] I think I know what's going on [13:20:58] Lucas_WMDE, marlier, revi, and Urbanecm: problems with scap, please stand by [13:21:06] beep beep [13:21:17] you got it [13:21:59] <_joe_> zeljkof: mw2270 ? I'm not sure it's live [13:22:10] <_joe_> lemme check what's going on [13:22:20] _joe_: yeah fixing. it's aac4d5119f8 that's the culrprit [13:22:56] <_joe_> akosiaris: ouch, how asinine of me [13:22:57] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) (owner: 10Filippo Giunchedi) [13:23:10] <_joe_> I checked racktables not puppet [13:23:20] <_joe_> it's racked but still not operating [13:23:30] yeah I am commenting it out [13:23:38] (03PS1) 10Alexandros Kosiaris: Comment out mw2270 for now from scap_proxies [puppet] - 10https://gerrit.wikimedia.org/r/418908 (https://phabricator.wikimedia.org/T188301) [13:23:50] zeljkof: ok problem found, we will need another 2 mins and you should be good to go [13:24:01] !log synchronised PHP 7.2.3 to thirdparty/php72 for stretch-wikimedia [13:24:03] (03CR) 10BBlack: "I think fe Resp is probably still a spam issue that's going to make it hard to tell the real problems." [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [13:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:12] * zeljkof thumbs up [13:24:15] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/10397/" [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) (owner: 10Filippo Giunchedi) [13:24:21] <_joe_> akosiaris: I would rather use mw2254 instead [13:24:32] <_joe_> but I can re-add it later [13:24:40] yeah please do that [13:25:01] (03CR) 10Alexandros Kosiaris: [C: 032] Comment out mw2270 for now from scap_proxies [puppet] - 10https://gerrit.wikimedia.org/r/418908 (https://phabricator.wikimedia.org/T188301) (owner: 10Alexandros Kosiaris) [13:26:14] (03CR) 10Zfilipin: [C: 031] wmf-config: enable Singapore oversample as default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [13:26:24] (03PS3) 10Jcrespo: Mariadb backups: improve role puppettization [puppet] - 10https://gerrit.wikimedia.org/r/418894 (owner: 10Volans) [13:27:06] (03CR) 10Jcrespo: [C: 032] Mariadb backups: improve role puppettization [puppet] - 10https://gerrit.wikimedia.org/r/418894 (owner: 10Volans) [13:27:41] (03CR) 10Zfilipin: [C: 031] Disable upload for non-admins on kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) (owner: 10Revi) [13:27:50] zeljkof: and you are good to go [13:28:15] akosiaris: thanks! running the same scap command that failed... [13:29:35] (03PS4) 10Rush: openstack: nova-compute jessie mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/417455 (https://phabricator.wikimedia.org/T188266) [13:30:19] akosiaris: scap does not look good... [13:30:33] meaning ? [13:30:33] stuck at `sync-apaches: 86% (ok: 238; fail: 0; left: 36)` [13:31:00] I'll know more in a minute or two, it usually does not stay stuck for long [13:31:34] !log zfilipin@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 03m 08s) [13:31:35] if it's waiting for a dns lookup timout [13:31:36] (03CR) 10Muehlenhoff: openstack: nova-compute jessie mitaka setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/417455 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:31:36] problems [13:31:36] lol [13:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:40] T184812: Enable constraint result caching on Wikidata - https://phabricator.wikimedia.org/T184812 [13:32:07] 10Operations, 10ops-codfw, 10netops: Interface errors on cr2-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T189452#4043057 (10Papaul) p:05Triage>03High [13:32:25] akosiaris: a lot of these... `13:31:30 ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/Wikibase-production.php', 'mw1284.eqiad.wmnet', 'mw1319.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw2215.codfw.wmnet', 'mw2187.codfw.wmnet', 'mw1250.eqiad.wmnet', 'mw1313.eqiad.wmnet'] on mw2123.codfw.wmnet returned [255]: ssh: connect to host mw2123.codfw.wmnet port 22: Connection timed [13:32:25] out` [13:32:43] but not only for mw2123 [13:33:05] `13:31:34 36 apaches had sync errors` [13:33:59] akosiaris: thx [13:34:06] akosiaris: I can copy/paste output to phab if it would help [13:34:13] zeljkof: please do [13:34:19] vgutierrez: ^ [13:34:19] I am looking into it in the meantime [13:35:42] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.36 ms [13:36:14] akosiaris, vgutierrez: https://phabricator.wikimedia.org/P6832 [13:36:24] if I don't respond to any ping within 1 or 2 minutes, consider me gone asleep [13:36:36] (and skip me today) [13:36:52] revi: ok, sorry for the delay, but scap trouble today [13:36:59] no worries, things happens [13:37:06] (03CR) 10BBlack: varnishslowlog: add fetch overhead introduced by varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418603 (owner: 10Ema) [13:37:17] I'm just feeling bit sleepy so you don't merge stuff when I'm in fact sleeping :P [13:37:38] revi: thanks for letting me know :) [13:38:59] (03CR) 10Zfilipin: [C: 031] Remove obsolete throttle rules, add one new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417687 (https://phabricator.wikimedia.org/T189241) (owner: 10Urbanecm) [13:40:02] those hosts should not be there [13:40:05] this is weird [13:42:06] (03PS5) 10Rush: openstack: nova-compute jessie mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/417455 (https://phabricator.wikimedia.org/T188266) [13:42:08] (03CR) 10Zfilipin: "This was scheduled for SWAT today, but we have problems with scap, this will probably not be deployed because we will run out of time. Nei" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/414758 (https://phabricator.wikimedia.org/T187894) (owner: 10Urbanecm) [13:43:06] zeljkof: +1 guarantees it will be deployed today? [13:43:22] PROBLEM - Confd template for /etc/dsh/group/mediawiki-installation on tin is CRITICAL: File not found: /etc/dsh/group/mediawiki-installation [13:43:26] nothing ever guarantees anything in this field [13:43:27] ;) [13:43:31] hm true [13:43:43] lol, tin is not happy [13:43:47] revi: no, just that I have reviewed it :) I will probably remove my votes at the end of the SWAT for commits that are not deployed [13:43:48] yeah that's me [13:43:53] kk! [13:44:01] (03CR) 10Zfilipin: [C: 031] Add ruwikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415580 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [13:45:14] Lucas_WMDE, marlier, revi, and Urbanecm: we are still having problems with deployment, there is a chance your patch(es) will not be deployed today, sorry about that [13:45:53] zeljkof: as far as I can tell my change is in effect right now, is that correct? [13:45:59] No worries, standing by [13:46:03] (which is fine by me, I just want to make sure it’s no surprise to you) [13:46:22] Lucas_WMDE: this is current status `sync-apaches: 100% (ok: 238; fail: 36; left: 0)` [13:47:02] then I guess some requests are hitting those 238 ok servers at leas [13:47:07] * Lucas_WMDE knows nothing about the apache setup [13:47:41] <_joe_> all requests go to the rigt serrvers [13:47:43] Lucas_WMDE: about 13% failure :( [13:47:50] <_joe_> the ones failing are in codfw and are decommissioned [13:48:16] ok, that is better news than I have expected :) [13:52:04] _joe_, vgutierrez, akosiaris: do you have an estimate if you will be able to fix the problem during SWAT window? [13:52:22] RECOVERY - Confd template for /etc/dsh/group/mediawiki-installation on tin is OK: No errors detected [13:52:24] zeljkof: about 5 mins I think. I think we nailed it [13:52:51] <_joe_> zeljkof: I hope so, but can you overrun by 10 minutes in case [13:52:55] <_joe_> ? [13:53:08] (03CR) 10Rush: [C: 032] openstack: nova-compute jessie mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/417455 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:53:46] _joe_: I'll ask if anything is urgent, or if it can be deployed later/tomorrow [13:53:49] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4043113 (10Papaul) Yes that is correct we have only 3 shelves 36 disks visible the reason being that we have 3 shelves (36 disks) connected to port 2 and 4 shelves (48 disks) connected to port 1... [13:54:31] (03PS1) 10Alexandros Kosiaris: Revert "mediawiki scap: add labweb1001 and 1002 targets" [puppet] - 10https://gerrit.wikimedia.org/r/418913 (https://phabricator.wikimedia.org/T168470) [13:54:44] Lucas_WMDE, marlier, revi, and Urbanecm: Sorry, looks like we will only have time for one patch from Lucas_WMDE. Is your patch urgent, or can it be deployed later today or tomorrow? [13:54:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "mediawiki scap: add labweb1001 and 1002 targets" [puppet] - 10https://gerrit.wikimedia.org/r/418913 (https://phabricator.wikimedia.org/T168470) (owner: 10Alexandros Kosiaris) [13:54:52] rescheduling [13:54:53] <_joe_> akosiaris: -1 [13:54:55] <_joe_> argh [13:55:10] <_joe_> you just needed to remove the first thing (the cluster) stanza [13:55:54] Mine can go later, I'll reschedule on the deployments page. [13:56:01] rescheduled. [13:56:03] goodnight! [13:56:19] <_joe_> sorry :/ [13:56:29] mine isn’t urgent, but I don’t know if you want to roll it back again [13:56:34] <_joe_> our monitoring should've caught this [13:56:35] _joe_: yeah I just decided to revert it as a whole just to be sure [13:56:43] <_joe_> akosiaris: that's ok [13:56:55] Lucas_WMDE: if scap is functional again, I'll try deploying again [13:57:06] zeljkof: give me 30 secs [13:57:12] akosiaris: sure [13:57:14] making sure this actually worked [13:57:23] <_joe_> it should've [13:57:56] <_joe_> it did [13:58:18] yeah looks like it [13:58:19] (03PS1) 10Rush: openstack: labtestn comment out base::firewall for now [puppet] - 10https://gerrit.wikimedia.org/r/418914 (https://phabricator.wikimedia.org/T188266) [13:58:20] <_joe_> zeljkof: I think you're ok to carry on, we'll just need to run scap sync on labweb by hand [13:58:38] _joe_: thanks! I'll try deploying again [13:58:40] let's see what crops it's ugly head now [13:58:44] (03CR) 10Zfilipin: wmf-config: enable Singapore oversample as default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [13:58:44] its* [13:58:49] Lucas_WMDE: please stand by, deploying again :) [13:58:54] (03CR) 10Rush: [C: 032] openstack: labtestn comment out base::firewall for now [puppet] - 10https://gerrit.wikimedia.org/r/418914 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [13:59:12] (03CR) 10Zfilipin: Disable upload for non-admins on kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) (owner: 10Revi) [13:59:14] (03PS2) 10Rush: openstack: labtestn comment out base::firewall for now [puppet] - 10https://gerrit.wikimedia.org/r/418914 (https://phabricator.wikimedia.org/T188266) [13:59:24] (03CR) 10Zfilipin: Remove obsolete throttle rules, add one new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417687 (https://phabricator.wikimedia.org/T189241) (owner: 10Urbanecm) [13:59:42] (03CR) 10Zfilipin: Add ruwikimedia to wikidataclient [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415580 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [13:59:57] !log zfilipin@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:416748|Enable caching of constraint check results (T184812)]] (duration: 00m 57s) [14:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:02] T184812: Enable constraint result caching on Wikidata - https://phabricator.wikimedia.org/T184812 [14:00:20] Lucas_WMDE: deployed! please monitor relevant logs and thanks for deploying with #releng! ;) [14:00:21] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [14:00:22] I am guessing this ^ means we are ok [14:00:23] nice [14:00:32] akosiaris: all good! thanks! [14:00:36] zeljkof: thank you very much, always a pleasure to work with #releng :) [14:00:36] yw [14:01:28] akosiaris, _joe_, vgutierrez: thanks for your help, scap is happy again :) [14:01:31] (03PS1) 10Giuseppe Lavagetto: scap::dsh: re-add labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/418915 [14:01:47] <_joe_> akosiaris: ^^ [14:02:30] !log EU SWAT finished [14:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:46] (03CR) 10Brian Wolff: "> Then which is a meanful way? I must create a RFC to fire Mattflaschen?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [14:03:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "So the first stanza in the first patch caused issues as it was not yet in conftool data and as a result confd would not update. I 've reve" [puppet] - 10https://gerrit.wikimedia.org/r/418913 (https://phabricator.wikimedia.org/T168470) (owner: 10Alexandros Kosiaris) [14:06:18] (03PS2) 10Giuseppe Lavagetto: scap::dsh: re-add labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/418915 [14:07:14] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) (owner: 10Filippo Giunchedi) [14:07:27] (03PS3) 10Giuseppe Lavagetto: scap::dsh: re-add labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/418915 [14:07:41] RECOVERY - MD RAID on labtestvirt2003 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [14:08:01] RECOVERY - dhclient process on labtestvirt2003 is OK: PROCS OK: 0 processes with command name dhclient [14:08:10] RECOVERY - configured eth on labtestvirt2003 is OK: OK - interfaces up [14:08:21] RECOVERY - Disk space on labtestvirt2003 is OK: DISK OK [14:08:21] RECOVERY - DPKG on labtestvirt2003 is OK: All packages OK [14:08:37] (03CR) 10Giuseppe Lavagetto: [C: 032] scap::dsh: re-add labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/418915 (owner: 10Giuseppe Lavagetto) [14:09:06] !log sbisson@tin Started deploy [tilerator/deploy@4bcae95]: Deploying tilerator#update-deps for testing on maps-test* [14:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:39] !log sbisson@tin Finished deploy [tilerator/deploy@4bcae95]: Deploying tilerator#update-deps for testing on maps-test* (duration: 00m 34s) [14:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:22] (03PS1) 10Subramanya Sastry: Enable RemexHTML on wikis with < 25 errors in high-priority categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418918 (https://phabricator.wikimedia.org/T188010) [14:16:30] PROBLEM - DPKG on labtestvirt2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:18:30] PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [14:18:30] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:30] RECOVERY - DPKG on labtestvirt2003 is OK: All packages OK [14:22:20] (03CR) 10MarcoAurelio: "Unnaceptable. T189489." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416346 (https://phabricator.wikimedia.org/T188862) (owner: 10MarcoAurelio) [14:23:40] RECOVERY - Host ripe-atlas-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 239.40 ms [14:23:40] RECOVERY - Host ripe-atlas-eqsin is UP: PING OK - Packet loss = 0%, RTA = 245.15 ms [14:26:26] (03PS1) 10BBlack: text-be: use short hfm time for cacheable+cookie case [puppet] - 10https://gerrit.wikimedia.org/r/418920 (https://phabricator.wikimedia.org/T181315) [14:27:10] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 297 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:28:41] (03CR) 10BBlack: [C: 032] text-be: use short hfm time for cacheable+cookie case [puppet] - 10https://gerrit.wikimedia.org/r/418920 (https://phabricator.wikimedia.org/T181315) (owner: 10BBlack) [14:32:10] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 9 probes of 297 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [14:33:49] PROBLEM - Check systemd state on labtestvirt2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:34:40] (03CR) 10Addshore: [C: 04-1] "Should this not first be enabled on beta somewhere?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [14:35:14] (03PS1) 10Ottomata: Use versioned refinery-job .jar for json_refine job [puppet] - 10https://gerrit.wikimedia.org/r/418923 [14:35:53] 10Operations, 10netops, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#4043288 (10ayounsi) 05Open>03Resolved This is done, all peers are up with proper new ASN. AS43821 is not in use anywhere in esams. [14:35:59] RECOVERY - IPMI Sensor Status on labtestvirt2003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [14:37:50] RECOVERY - Check systemd state on labtestvirt2003 is OK: OK - running: The system is fully operational [14:38:44] (03CR) 10Filippo Giunchedi: [C: 04-1] "Still unmergeable" [puppet] - 10https://gerrit.wikimedia.org/r/410136 (https://phabricator.wikimedia.org/T175243) (owner: 10Gehel) [14:39:38] (03PS3) 10Filippo Giunchedi: Decom restbase-test cluster and role [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) [14:40:50] PROBLEM - Check systemd state on labtestvirt2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:41:27] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10401/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/418923 (owner: 10Ottomata) [14:41:29] (03CR) 10Ottomata: [C: 032] Use versioned refinery-job .jar for json_refine job [puppet] - 10https://gerrit.wikimedia.org/r/418923 (owner: 10Ottomata) [14:41:33] (03PS2) 10Ottomata: Use versioned refinery-job .jar for json_refine job [puppet] - 10https://gerrit.wikimedia.org/r/418923 [14:41:37] (03CR) 10Ottomata: [V: 032 C: 032] Use versioned refinery-job .jar for json_refine job [puppet] - 10https://gerrit.wikimedia.org/r/418923 (owner: 10Ottomata) [14:42:11] (03PS4) 10Filippo Giunchedi: Decom restbase-test cluster and role [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) [14:42:54] !log upgrade and restart es2001 [14:42:59] (03Abandoned) 10Ottomata: Parse raw user_agent out of raw eventlogging client side event [puppet] - 10https://gerrit.wikimedia.org/r/415691 (https://phabricator.wikimedia.org/T188673) (owner: 10Ottomata) [14:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:41] (03CR) 10Filippo Giunchedi: [C: 032] Decom restbase-test cluster and role [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) (owner: 10Filippo Giunchedi) [14:43:54] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4043303 (10faidon) p:05High>03Unbreak! ``` faidon@re0.cr1-eqiad> show arp no-resolve | match 10.64.0.17 78:2b:cb:2d:fa:e6 10.64.0.17 ae1.1017 none faidon@... [14:49:09] RECOVERY - Check systemd state on labtestvirt2003 is OK: OK - running: The system is fully operational [14:50:55] (03PS4) 10Muehlenhoff: Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 [14:51:23] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4043360 (10faidon) I'd like all the 5 shelves (array3-7) connected to furud, but //not// the 2 old ones (array1-2) until further notice. Can we just bypass array1-2 by disconnecting them entirely... [14:51:40] (03CR) 10jerkins-bot: [V: 04-1] Allow to selectively run time servers on Chrony (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/393581 (owner: 10Muehlenhoff) [14:53:59] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4043362 (10faidon) a:05faidon>03ayounsi Just heard from RIPE: ``` I just finished the provisioning of sg-sin-as14907.anchors.atlas.ripe.net and noticed that port 5666 is filtered.... [14:54:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 3 others: Decommission restbase-test environment - https://phabricator.wikimedia.org/T186755#4043364 (10fgiunchedi) [14:55:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission xenon, cerium, praseodymium - https://phabricator.wikimedia.org/T187446#4043366 (10fgiunchedi) [14:55:28] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4043368 (10fgiunchedi) [14:55:32] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for diamond [puppet] - 10https://gerrit.wikimedia.org/r/418926 (https://phabricator.wikimedia.org/T135991) [14:57:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 3 others: Decommission restbase-test environment - https://phabricator.wikimedia.org/T186755#3954137 (10fgiunchedi) All hosts in this task and its subtasks are ready for decom (running as spare systems now) [14:59:29] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4043392 (10ayounsi) Should be good now for eqsin. [15:02:21] !log joal@tin Started deploy [analytics/refinery@fd0a90f]: Regular a [15:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:00] Hi team - Sorry for the bad looking deployment message - keyboard splippy [15:03:42] no problem, maybe you can ! log manually the full message, if yo uwant? [15:07:13] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.20 (duration: 03m 58s) [15:07:15] !log joal@tin Finished deploy [analytics/refinery@fd0a90f]: Regular a (duration: 04m 54s) [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:36] jy [15:07:48] PROBLEM - etcd request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 127085 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:07:53] jynus: I'll do that - it's a regular deploy so no big deal [15:08:09] PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - apiserver_request_latencies is 178116 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:08:35] !log Provide correct log message for analytics/refinery scap deploy: Regular deploy of analytics-hadoop code [15:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:41] Thanks for the hint jynus [15:08:44] <_joe_> uhm akosiaris is that you on neon? [15:09:10] no [15:09:34] it's reporting a wrong dashboard btw [15:10:09] RECOVERY - Request latencies on neon is OK: OK - apiserver_request_latencies is 24296 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:10:12] _joe_: this one is real [15:10:17] https://grafana.wikimedia.org/dashboard/db/kubernetes-staging-api?orgId=1 [15:10:20] there was spike [15:10:39] and etcd latencies of 300ms [15:10:48] RECOVERY - etcd request latencies on neon is OK: OK - etcd_request_latencies is 4044 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:10:59] <_joe_> which is not horrible I guess [15:11:05] even more.. close to 500ms [15:11:10] anyway this alert worked! [15:11:15] and we haven't even done anything [15:11:16] <_joe_> it's CAS request [15:11:17] :) [15:11:26] <_joe_> indeed [15:11:38] I am guessing some CAS was too slow ? [15:11:38] !log ppchelko@tin Started deploy [cpjobqueue/deploy@5686f16]: Decrease refreshLinks concurrency to 120 [15:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:09] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@5686f16]: Decrease refreshLinks concurrency to 120 (duration: 00m 31s) [15:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:34] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.21 (duration: 02m 35s) [15:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:42] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.22 [keeping static files] (duration: 01m 22s) [15:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:26] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) [15:22:57] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:23:44] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) [15:24:12] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:24:14] !log lvs1007,lvs1010 upgraded pybal to 1.15.2 [15:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:34] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) [15:26:08] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.23 [keeping static files] (duration: 01m 19s) [15:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:41] (03CR) 10Gilles: "I've prepared PrivateSettings.php, but we'll need to coordinate running setZoneAccess with merging and deploying this, because running it " [puppet] - 10https://gerrit.wikimedia.org/r/414631 (https://phabricator.wikimedia.org/T187822) (owner: 10Gilles) [15:28:22] !log gilles@tin Synchronized private/PrivateSettings.php.example: Thumbor private wiki support deployment: [[gerrit:414631| Set up separate Thumbor Swift user for private containers (T187822)]] (duration: 00m 54s) [15:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:27] T187822: Have Thumbor use a different Swift user when dealing with private containers - https://phabricator.wikimedia.org/T187822 [15:35:22] (03PS1) 10Andrew Bogott: labweb: add wikitech-static monitoring to labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/418932 (https://phabricator.wikimedia.org/T168470) [15:36:41] (03PS2) 10Andrew Bogott: labweb: add wikitech-static monitoring to labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/418932 (https://phabricator.wikimedia.org/T168470) [15:37:07] !log disabling puppet on kafka1020,1022,1023 to test partition.assigment.strategy change for mirror maker [15:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:10] !log bouncing kafka main-eqiad -> jumbo-eqiad mirror maker instances [15:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:47] jouncebot: next [15:41:47] In 1 hour(s) and 18 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180312T1700) [15:43:21] !log eqsin LVSs: upgrade pybal to 1.15.2 [15:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:29] (03PS6) 10Gehel: wdqs: cleanup JVM options for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/388026 (https://phabricator.wikimedia.org/T175919) [15:46:45] PROBLEM - HHVM jobrunner on mw1334 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [15:47:05] RECOVERY - Long running screen/tmux on labtestvirt2003 is OK: OK: No SCREEN or tmux processes detected. [15:47:45] RECOVERY - HHVM jobrunner on mw1334 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [15:47:53] (03CR) 10Andrew Bogott: [C: 032] labweb: add wikitech-static monitoring to labweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/418932 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [15:48:02] (03CR) 10Gilles: varnishslowlog: filter on all timestamps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [15:48:25] (03PS7) 10Gehel: wdqs: cleanup JVM options for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/388026 (https://phabricator.wikimedia.org/T175919) [15:49:13] (03PS9) 10Gehel: wdqs: cleanup JVM options for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/388026 (https://phabricator.wikimedia.org/T175919) [15:49:50] (03CR) 10Gehel: [C: 032] wdqs: cleanup JVM options for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/388026 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [15:51:30] !log restart blazegraph on wdqs2001 to validate new config - T175919 [15:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:35] T175919: investigate GC times on wikidata query service - https://phabricator.wikimedia.org/T175919 [15:51:41] (03PS1) 10Ottomata: Use roundrobin partition.assignment.strategy for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/418934 (https://phabricator.wikimedia.org/T189464) [15:53:30] (03CR) 10Elukey: [C: 031] Use roundrobin partition.assignment.strategy for Kafka MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/418934 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [15:55:40] 10Operations, 10UniversalLanguageSelector, 10I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#4043602 (10mehtab.ahmed) @Aklapper this is good, and this task should be removed now. [15:59:46] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4043614 (10ayounsi) a:05ayounsi>03faidon [16:00:10] (03PS1) 10Andrew Bogott: silver: remove wikitech, mark as spare [puppet] - 10https://gerrit.wikimedia.org/r/418941 (https://phabricator.wikimedia.org/T168470) [16:10:31] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4043642 (10MoritzMuehlenhoff) p:05Triage>03Normal [16:19:48] (03PS1) 10Rush: openstack: nova-compute on mitaka and jessie changes [puppet] - 10https://gerrit.wikimedia.org/r/418945 (https://phabricator.wikimedia.org/T188266) [16:21:27] 10Operations, 10OTRS, 10Stewards-and-global-tools: https://meta.wikimedia.org/wiki/Special:Contact/Stewards is being abused by spammers - https://phabricator.wikimedia.org/T188985#4043673 (10akosiaris) No complaints in 6 days, I consider the problem resolved. I 'll keep this open for a few more days so that... [16:25:37] !log joal@tin Started deploy [analytics/refinery@1ef2e27]: Deploy patch over regula rdeploy bug [16:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:15] (03PS2) 10Rush: openstack: nova-compute on mitaka and jessie changes [puppet] - 10https://gerrit.wikimedia.org/r/418945 (https://phabricator.wikimedia.org/T188266) [16:34:27] !log joal@tin Finished deploy [analytics/refinery@1ef2e27]: Deploy patch over regula rdeploy bug (duration: 08m 50s) [16:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:15] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Rebuild raids on labvirt1019 and 1020 - https://phabricator.wikimedia.org/T187373#4043737 (10RobH) Update: My understanding is this is now awating Chris to open a support case with HP about this. Once we have that, if they don't provide a solut... [16:36:57] PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - apiserver_request_latencies is 121098 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:37:57] RECOVERY - Request latencies on neon is OK: OK - apiserver_request_latencies is 5017 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:42:19] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4043780 (10akosiaris) The scap targets that would benefit from this (namely `ores*` boxes) now have git-lfs installed. @mmodell do we also need this... [17:00:04] gehel: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180312T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:04:58] (03Abandoned) 10Elukey: eventlogging: remove zmq-forwarder [puppet] - 10https://gerrit.wikimedia.org/r/416471 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [17:05:28] Krinkle: o/ - any plans for https://gerrit.wikimedia.org/r/#/c/415218/ ? We are scheduling the migration of EL data from Analytics to Jumbo [17:06:23] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4043863 (10Dzahn) Let's not set it to role::spare::system please. That would mean activel... [17:06:29] (03CR) 10Smalyshev: [C: 031] Add consumer ID to Updater launch string [puppet] - 10https://gerrit.wikimedia.org/r/418873 (https://phabricator.wikimedia.org/T188716) (owner: 10Gehel) [17:09:07] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Give Roan Kattouw the rights to deploy maps and restart maps-related services - https://phabricator.wikimedia.org/T189153#4033165 (10Vgutierrez) This has been approved during today's operations meeting. [17:10:24] (03CR) 10Dzahn: [C: 031] Give Roan the privileges to restart maps-related services [puppet] - 10https://gerrit.wikimedia.org/r/417230 (https://phabricator.wikimedia.org/T189153) (owner: 10Muehlenhoff) [17:13:11] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for diamond [puppet] - 10https://gerrit.wikimedia.org/r/418926 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:14:24] (03PS3) 10Dzahn: Add romd.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/412896 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [17:15:15] (03CR) 10Rush: [C: 032] openstack: nova-compute on mitaka and jessie changes [puppet] - 10https://gerrit.wikimedia.org/r/418945 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:15:23] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for exim4/sender config [puppet] - 10https://gerrit.wikimedia.org/r/418930 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:15:25] (03CR) 10Dzahn: [C: 032] "approval on meeting etherpad" [puppet] - 10https://gerrit.wikimedia.org/r/417230 (https://phabricator.wikimedia.org/T189153) (owner: 10Muehlenhoff) [17:15:33] (03PS2) 10Dzahn: Give Roan the privileges to restart maps-related services [puppet] - 10https://gerrit.wikimedia.org/r/417230 (https://phabricator.wikimedia.org/T189153) (owner: 10Muehlenhoff) [17:15:43] (03PS1) 10Elukey: eventlogging: reduce eventlog1001's scope [puppet] - 10https://gerrit.wikimedia.org/r/418953 (https://phabricator.wikimedia.org/T114199) [17:16:35] (03CR) 10jerkins-bot: [V: 04-1] eventlogging: reduce eventlog1001's scope [puppet] - 10https://gerrit.wikimedia.org/r/418953 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [17:21:39] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Give Roan Kattouw the rights to deploy maps and restart maps-related services - https://phabricator.wikimedia.org/T189153#4033165 (10Dzahn) ``` [maps1001:~] $ id catrope uid=546(catrope) gid=500(wikidev) groups=500(wikidev),758(tilerat... [17:21:49] (03PS2) 10Gehel: Add consumer ID to Updater launch string [puppet] - 10https://gerrit.wikimedia.org/r/418873 (https://phabricator.wikimedia.org/T188716) [17:22:03] (03CR) 10Elukey: [V: 032 C: 032] "Pcc: https://puppet-compiler.wmflabs.org/compiler02/10406/" [puppet] - 10https://gerrit.wikimedia.org/r/418953 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [17:22:17] (03PS2) 10Elukey: eventlogging: reduce eventlog1001's scope [puppet] - 10https://gerrit.wikimedia.org/r/418953 (https://phabricator.wikimedia.org/T114199) [17:22:19] (03CR) 10Elukey: [V: 032 C: 032] eventlogging: reduce eventlog1001's scope [puppet] - 10https://gerrit.wikimedia.org/r/418953 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [17:22:33] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4043978 (10mobrovac) These jobs are not high-traffic, so consolidating the job runners and spreading the load all over them sounds like a good... [17:22:42] (03CR) 10Dzahn: [C: 032] Add romd.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/412896 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [17:22:46] (03CR) 10Gehel: [C: 032] Add consumer ID to Updater launch string [puppet] - 10https://gerrit.wikimedia.org/r/418873 (https://phabricator.wikimedia.org/T188716) (owner: 10Gehel) [17:23:16] (03PS3) 10Gehel: Add consumer ID to Updater launch string [puppet] - 10https://gerrit.wikimedia.org/r/418873 (https://phabricator.wikimedia.org/T188716) [17:24:29] !log gehel@tin Started deploy [wdqs/wdqs@ce72538]: new wdqs updater [17:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:26] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Give Roan Kattouw the rights to deploy maps and restart maps-related services - https://phabricator.wikimedia.org/T189153#4044011 (10Dzahn) 05Open>03Resolved a:03Catrope [17:27:11] <_joe_> !log poweroff mw2097-2134, T189111 [17:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:17] T189111: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111 [17:27:37] 10Operations, 10Ops-Access-Requests, 10Research, 10Research-collaborations, and 2 others: Request access to data for Wikimedia Donation Patterns research - https://phabricator.wikimedia.org/T188945#4044017 (10Dzahn) a:03Vgutierrez [17:29:16] !log gehel@tin Finished deploy [wdqs/wdqs@ce72538]: new wdqs updater (duration: 04m 47s) [17:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:30] 10Operations, 10UniversalLanguageSelector, 10I18n: MB Lateefi Fonts for Sindhi Wikipedia. - https://phabricator.wikimedia.org/T138136#4044022 (10Aklapper) 05stalled>03declined Thanks a lot for the clarification! [17:30:16] SMalyshev: deployment completed, tests are green, updater restarted (but still on RC Updates, no kafka poller yet) [17:30:18] (03PS1) 10Rush: openstack: nova-compute /var/lib/nova/${certname}.key [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) [17:30:47] (03PS2) 10Rush: openstack: nova-compute /var/lib/nova/${certname}.key [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) [17:30:49] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-compute /var/lib/nova/${certname}.key [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:31:10] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-compute /var/lib/nova/${certname}.key [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:31:27] (03PS3) 10Rush: openstack: nova-compute /var/lib/nova/certname.key [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) [17:31:52] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-compute /var/lib/nova/certname.key [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:32:51] (03PS4) 10Rush: openstack: nova-compute certname libvirt group [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) [17:32:54] (03PS5) 10Rush: openstack: nova-compute /var/lib/nova/certname.key [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) [17:33:30] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-compute /var/lib/nova/certname.key [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:34:59] (03PS6) 10Rush: openstack: set certname libvirt group [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) [17:38:09] (03PS1) 10Giuseppe Lavagetto: puppet: remove all references to mw2097-2134 [puppet] - 10https://gerrit.wikimedia.org/r/418958 (https://phabricator.wikimedia.org/T189111) [17:38:27] (03PS1) 10Ottomata: Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 [17:38:59] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: remove all references to mw2097-2134 [puppet] - 10https://gerrit.wikimedia.org/r/418958 (https://phabricator.wikimedia.org/T189111) (owner: 10Giuseppe Lavagetto) [17:39:36] (03PS4) 10Ahmed123: Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700 (https://phabricator.wikimedia.org/T148603) [17:39:48] (03CR) 10Rush: [C: 032] openstack: set certname libvirt group [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:39:55] (03PS7) 10Rush: openstack: set certname libvirt group [puppet] - 10https://gerrit.wikimedia.org/r/418957 (https://phabricator.wikimedia.org/T188266) [17:40:28] (03CR) 10Zoranzoki21: [C: 031] Revert "Restrict FlaggedRevs to only operated on NS_MAIN on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418700 (https://phabricator.wikimedia.org/T148603) (owner: 10Ahmed123) [17:40:54] (03PS2) 10Ottomata: Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 [17:41:04] 10Operations, 10Ops-Access-Requests: Requesting deployment access for samwilson - https://phabricator.wikimedia.org/T189414#4044059 (10kaldari) Approved! [17:42:17] (03PS3) 10Ottomata: Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 [17:42:56] (03PS4) 10Ottomata: Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 [17:43:10] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4037542 (10Dzahn) This could be solved by adding bmansurov to either one of these groups: ``` admin::groups: - restricted - deployment - ldap-admins - maintenance-lo... [17:43:51] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:45:13] (03PS5) 10Ottomata: Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 [17:45:45] (03CR) 10jerkins-bot: [V: 04-1] Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 (owner: 10Ottomata) [17:46:13] (03PS6) 10Ottomata: Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 [17:46:52] (03CR) 10Elukey: [C: 031] Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 (owner: 10Ottomata) [17:47:05] (03CR) 10Ottomata: [C: 032] Use proper kafka protocol version for varnishkafka webrequest [puppet] - 10https://gerrit.wikimedia.org/r/418959 (owner: 10Ottomata) [17:48:32] (03PS1) 10Giuseppe Lavagetto: Decommission mw2097-mw2134 [dns] - 10https://gerrit.wikimedia.org/r/418960 (https://phabricator.wikimedia.org/T189111) [17:48:55] !log removed kafka.protocol.version setting for varnishkafka webrequest instances; version should now be properly negotiated [17:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:47] (03CR) 10Dzahn: [C: 032] "what about the ".m." mobile name?" [dns] - 10https://gerrit.wikimedia.org/r/412896 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [17:50:58] (03CR) 10Dzahn: "also add the ".m." as Jayprakash already commented?" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/417199 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [17:54:42] (03CR) 10Giuseppe Lavagetto: [C: 032] Decommission mw2097-mw2134 [dns] - 10https://gerrit.wikimedia.org/r/418960 (https://phabricator.wikimedia.org/T189111) (owner: 10Giuseppe Lavagetto) [17:58:22] (03PS1) 10Ottomata: Proerply configury canary varnishkafka to send eventlogging to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/418962 [17:58:58] (03CR) 10Ottomata: [C: 032] Proerply configury canary varnishkafka to send eventlogging to analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/418962 (owner: 10Ottomata) [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180312T1800). [18:00:04] marlier: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:09] here [18:02:00] (03CR) 10Mforns: "I added some comments just for reference." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405727 (https://phabricator.wikimedia.org/T174386) (owner: 10Fdans) [18:02:51] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4044173 (10mmodell) I think it's needed on masters at least for deployers to issue git-lfs commands if not for scap itself. [18:04:28] (03PS2) 10Andrew Bogott: silver: remove wikitech, mark as spare [puppet] - 10https://gerrit.wikimedia.org/r/418941 (https://phabricator.wikimedia.org/T168470) [18:04:50] 10Operations, 10Puppet: Update jmx_exporter mbeans whitelist for puppetdb 4 - https://phabricator.wikimedia.org/T189516#4044175 (10fgiunchedi) p:05Triage>03Normal [18:08:19] So, uh....is there not going to be a SWAT deploy? [18:09:28] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4044214 (10Joe) [18:09:35] Urbanecm: what about mobile URLs for the new chapter wikis? [18:09:40] !log ppchelko@tin Started deploy [restbase/deploy@754aa8c]: Enable ensure_content_type filter for summaries [18:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:50] i guess we should add ".m." right away [18:10:16] mutante, langcode.m.wikimedia.org will do from my POV [18:10:43] Urbanecm: but it depends whether the wiki has MobileFrontend extension enabled ? [18:10:51] or not anymore [18:11:19] i forget, but i remember we tried to be more consistent about it and added missing .m. a lot [18:12:46] mutante, well, every wiki has MobileFrontend enabled, or am I wrong? [18:12:57] anyway, should I add it or are you going to add the .m.? [18:13:22] Urbanecm: add it please for .hi. and both if you like [18:13:41] what does "both" mean in your message? [18:13:47] hi and romd [18:13:49] ok [18:14:36] 'wmgMobileFrontend' => [ [18:14:37] 'default' => true, [18:14:39] ack [18:15:10] elukey: Hoping to roll out today or tomorrow (re: coal/kafka) [18:15:17] Just need a final check. [18:15:24] awesome, thanks :) [18:16:21] (03PS1) 10Urbanecm: Add mobile DNS entry for romdwikimedia [dns] - 10https://gerrit.wikimedia.org/r/418965 (https://phabricator.wikimedia.org/T187184) [18:16:33] (03CR) 10jerkins-bot: [V: 04-1] Add mobile DNS entry for romdwikimedia [dns] - 10https://gerrit.wikimedia.org/r/418965 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [18:17:21] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [18:17:37] (03PS2) 10Urbanecm: Add mobile DNS entry for romdwikimedia [dns] - 10https://gerrit.wikimedia.org/r/418965 (https://phabricator.wikimedia.org/T187184) [18:17:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [18:18:05] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4044259 (10RobH) So I reviewed this with @joe, and we had to make the implicient decision to skip some of the decom steps. Specifically, these systems do... [18:18:29] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4044261 (10RobH) a:05RobH>03Papaul @papaul: You can take this over for onsite disk wipes at this time. [18:18:30] (03PS2) 10Urbanecm: Add hi.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/417199 (https://phabricator.wikimedia.org/T188366) [18:18:30] ^ mobileapps issue is known, fix will be deployed during the upcoming service deploy window [18:18:31] mutante, done [18:18:40] we should be able to ack these mobileapps page/references errors until the next Services deploy window [18:18:57] I just don't have the permission to ack them myself [18:19:08] bearND: yeah.... i forgot how that's done :) [18:19:16] if you don't then i don't imagine i do either [18:19:22] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [18:19:29] (03CR) 10Dzahn: [C: 032] Add mobile DNS entry for romdwikimedia [dns] - 10https://gerrit.wikimedia.org/r/418965 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [18:20:01] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [18:20:05] mdholloway: maybe mutante knows if we should be able to ack these mobileapps warnings [18:20:10] (03PS3) 10Dzahn: Add hi.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/417199 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [18:20:22] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [18:20:27] (03CR) 10Dzahn: [C: 032] Add hi.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/417199 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [18:20:44] bearND: honestly we probably should have permission to do that, though. i wonder who to bother about that. [18:21:08] 10Operations, 10ops-codfw, 10netops: audit codfw switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4044272 (10RobH) p:05Triage>03Normal [18:21:20] do you get emails about the icinga alerts? [18:21:27] if you do then you should also have permissions to ack them [18:21:34] it's based on being a contact for them [18:21:36] mutante: no, i don't [18:21:51] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [18:22:01] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [18:22:28] mutante: I don't get any email notification for these either [18:22:31] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = [18:22:34] bearND: how long should they be in maintenance? [18:23:07] does "until next state change" [18:23:21] ACKNOWLEDGEMENT - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/references/{title} (retrieve structured reference data for the Cat article on English Wikipedia) is WARNING: Test retrieve structured reference data for the Cat article on English Wikipedia responds with unexpected value at path /reference_lists[1]/id = daniel_zahn deployment [18:23:34] mutante: the next deploy should happen in 2 hours, maybe give it 3 hours tops [18:24:05] is this something that affects actual mobile users? [18:24:45] mutante: no, it's not user-facing. actually, this endpoint isn't even publicly exposed yet. [18:24:47] bearND: it's going to stop alerting until the state changes to OK [18:24:47] mutante: no, that's for a new endpoint that is even exposed through restbase yet. Nobody is using it [18:25:04] !log ppchelko@tin Finished deploy [restbase/deploy@754aa8c]: Enable ensure_content_type filter for summaries (duration: 15m 25s) [18:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:24] ok, cool. [18:25:34] so i can fix the Icinga thing if we can make a ticket for it please [18:25:53] like let me know which services you think you should have it for and i will add contacts for you [18:26:07] then you will get email and permissions to ack/schedule downtime [18:26:27] for just "your" services (which you are a contact for), not all [18:28:05] the ACK is better than "schedule downtime" as long as it's not flapping and just a clean "down" and then later "up" again. it will automatically reactivate once it changes back to OK once [18:28:34] that way you don't have to remember to activate it again or know the exact duration [18:28:55] mutante: ok, will do! bearND, just you and me for mobileapps, then? [18:30:21] mutante: mdholloway: sounds good to me. Maybe we should add the new readinglists services, too [18:30:35] bearND: ah, good point [18:31:36] mutante: for readinglists (not sure what the exact spelling is) you'd want to add tgr, too. [18:35:53] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4044380 (10faidon) [18:37:04] mutante: what project should i tag that ticket with, btw? [18:37:07] 10Operations, 10ops-codfw, 10ops-eqiad, 10netops: Audit switch ports/descriptions/enable - https://phabricator.wikimedia.org/T189519#4044272 (10faidon) I just ran into a similar thing today in eqiad with T188045, so I reworded the task to make it generic and for both data centers. I also added a sentence t... [18:37:31] mdholloway: 'operations' and 'monitoring' will do [18:37:41] mutante: ok, thanks! incoming shortly [18:38:01] Since the 2PM SWAT deploy didn't happen, presumably I should just reschedule my change to the next SWAT window? [18:40:13] bearND: ideally please add on the ticket list of people and list of services. i'll make "contactgroups" for it [18:40:20] mdholloway: cool [18:40:53] 10Operations, 10netops: Detect IP address collisions - https://phabricator.wikimedia.org/T189522#4044398 (10faidon) p:05Triage>03High [18:43:03] !log added to DNS: hi.wikimedia.org (and hi.m) for Hindi Wikimedian User Group [18:43:04] (03PS1) 10Chad: Gerrit 2.14.7 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/418966 [18:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:01] Urbanecm: thanks ..and done [18:44:06] yw [18:44:42] !log added to DNS: romd.wikimedia.org (and romd.m) for Wikimedians of Romania and Moldova User Group [18:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:14] bearND: i wonder if tgr|away should get mobileapps alerts too, just for completeness [18:46:17] mdholloway: that's up to him [18:47:26] marlier: regarding the swat questions (not puppet-swat) try asking directly in -releng or a releng team member if you are still waiting for a reply [18:47:39] 10Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4044439 (10faidon) [18:47:44] okay, thanks [18:53:01] !log Clean up left-over .wsp.bak files under frontend.navtiming* on graphite1001 (following T179622) [18:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:07] T179622: Update our Graphite metrics for current retention rules - https://phabricator.wikimedia.org/T179622 [18:58:26] bearND: for now i'm going to file the task with just us as contacts for mobileapps. if tgr|away wants to be a mobileapps alert contact we can do that later. [18:59:07] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#4044465 (10katielin) a:05MeganHernandez_WMF>03jrobell @jrobell - could you sponsor the req... [18:59:20] 10Operations, 10monitoring: Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4044468 (10Mholloway) [18:59:24] PROBLEM - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:24] PROBLEM - Host mw2112 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:24] PROBLEM - Host mw2113 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:24] PROBLEM - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:24] PROBLEM - Host mw2115 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:24] PROBLEM - Host mw2116 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:24] PROBLEM - Host mw2117 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:25] PROBLEM - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:25] PROBLEM - Host mw2119 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:26] PROBLEM - Host mw2120 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:26] PROBLEM - Host mw2121 is DOWN: PING CRITICAL - Packet loss = 100% [18:59:27] PROBLEM - Host mw2122 is DOWN: PING CRITICAL - Packet loss = 100% [19:05:41] (03PS5) 10Ottomata: Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [19:09:31] (03PS6) 10Ottomata: Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [19:12:14] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4044513 (10RobH) So a bunch of these just alerted in icinga: ``` 11:59 < icinga-wm> : PROBLEM - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100... [19:13:16] (03PS7) 10Ottomata: Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [19:19:07] (03PS8) 10Ottomata: Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [19:23:19] (03PS9) 10Ottomata: Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [19:27:50] 10Operations, 10monitoring, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4044554 (10mobrovac) [19:30:14] ACKNOWLEDGEMENT - Host mw2111 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:14] ACKNOWLEDGEMENT - Host mw2112 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:14] ACKNOWLEDGEMENT - Host mw2113 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:14] ACKNOWLEDGEMENT - Host mw2114 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:14] ACKNOWLEDGEMENT - Host mw2115 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:14] ACKNOWLEDGEMENT - Host mw2116 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:14] ACKNOWLEDGEMENT - Host mw2117 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:15] ACKNOWLEDGEMENT - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:15] ACKNOWLEDGEMENT - Host mw2119 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:16] ACKNOWLEDGEMENT - Host mw2120 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:16] ACKNOWLEDGEMENT - Host mw2121 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:30:17] ACKNOWLEDGEMENT - Host mw2122 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black Decom noted in T189111 [19:32:08] 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review, and 2 others: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4031698 (10BBlack) (I acked those with a ref to this ticket for now, to reduce overall icinga redness) [19:33:47] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for bmansurov - https://phabricator.wikimedia.org/T189285#4044564 (10DarTar) This is approved on my end, if manager approval is needed. Thanks for getting the ball rolling, @bmansurov. [19:35:35] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10411/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) (owner: 10Ottomata) [19:36:06] (03CR) 10Ottomata: [C: 032] Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) (owner: 10Ottomata) [19:36:20] 10Operations, 10monitoring, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4044572 (10Dzahn) a:03Dzahn [19:44:37] !log labstore1003:~# exportfs -ra [19:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:54] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4044589 (10Krinkle) [19:48:43] !log labstore1003:~# service nfs-kernel-server restar [19:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:29] (03PS1) 10Ottomata: Add --queue opt default to production for Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/418985 (https://phabricator.wikimedia.org/T186833) [19:50:38] (03CR) 10Paladox: [C: 031] "tested on https://gerrit.git.wmflabs.org/r/#/q/status:open and works :)" [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/418966 (owner: 10Chad) [19:51:34] (03PS1) 10BryanDavis: beta: Enable password authn for Beta Cluster logstash [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) [19:51:46] (03CR) 10Ottomata: [C: 032] Add --queue opt default to production for Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/418985 (https://phabricator.wikimedia.org/T186833) (owner: 10Ottomata) [19:53:48] !log disabled 2FA for User:Ctac (T189520) [19:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180312T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:11:24] !log mholloway-shell@tin Started deploy [mobileapps/deploy@c764714]: Update mobileapps to 5c90db7 [20:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:56] (03CR) 10BryanDavis: "Cherry-picked to deployment-puppetmaster02 and applied on deployment-logstash2" [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis) [20:14:03] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/css/preview (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/css/pageview (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 404 (expecting: 200) [20:14:44] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [20:14:47] (03PS3) 10Andrew Bogott: silver: remove wikitech, mark as spare [puppet] - 10https://gerrit.wikimedia.org/r/418941 (https://phabricator.wikimedia.org/T168470) [20:15:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [20:15:53] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [20:15:59] (03CR) 10Andrew Bogott: [C: 032] silver: remove wikitech, mark as spare [puppet] - 10https://gerrit.wikimedia.org/r/418941 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:16:03] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [20:16:43] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [20:16:53] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@c764714]: Update mobileapps to 5c90db7 (duration: 05m 29s) [20:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:49] (03PS1) 10Rush: openstack: dealing with libvirt differences trusty vs jessie [puppet] - 10https://gerrit.wikimedia.org/r/418989 (https://phabricator.wikimedia.org/T188266) [20:18:15] (03PS2) 10Rush: openstack: dealing with libvirt differences trusty vs jessie [puppet] - 10https://gerrit.wikimedia.org/r/418989 (https://phabricator.wikimedia.org/T188266) [20:18:49] (03CR) 10jerkins-bot: [V: 04-1] openstack: dealing with libvirt differences trusty vs jessie [puppet] - 10https://gerrit.wikimedia.org/r/418989 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:20:41] mdholloway: have you noticed the 'page/css/preview' spec error ^^^? [20:21:40] (03CR) 10BBlack: varnishslowlog: filter on all timestamps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [20:22:32] bearND: i noticed that. but then there was a recovery message saying all endpoints were healthy for mobileapps LVS codfw [20:22:46] i'm looking at the icinga dashboard to try and double-check [20:22:55] mdholloway: ah, good. Yeah, I see no active alerts anymore. :) [20:23:46] bearND: ok, good, not sure why they would have 404'd to begin with, but i guess we're in the clear now [20:24:13] mdholloway: yeah, that was weird [20:24:34] (03PS3) 10Rush: openstack: dealing with libvirt differences trusty vs jessie [puppet] - 10https://gerrit.wikimedia.org/r/418989 (https://phabricator.wikimedia.org/T188266) [20:24:48] (03PS4) 10Rush: openstack: dealing with libvirt differences trusty vs jessie [puppet] - 10https://gerrit.wikimedia.org/r/418989 (https://phabricator.wikimedia.org/T188266) [20:25:23] !log stopping apache2 on Silver in anticipation of it being decommissioned [20:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:29] !log apt-get upgrade and reboot on wikitech-static [20:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:36] (03PS6) 10Herron: WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) [20:27:08] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [20:27:24] (03Draft1) 10Paladox: Gerrit: Remove quotes around cookiePath [puppet] - 10https://gerrit.wikimedia.org/r/418990 [20:27:28] (03PS2) 10Paladox: Gerrit: Remove quotes around cookiePath [puppet] - 10https://gerrit.wikimedia.org/r/418990 [20:27:31] !log arlolra@tin Started deploy [parsoid/deploy@174c87d]: Updating Parsoid to 16ced34 [20:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:48] (03PS3) 10Paladox: Gerrit: Remove quotes around cookiePath [puppet] - 10https://gerrit.wikimedia.org/r/418990 [20:28:12] (03PS7) 10Herron: WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) [20:28:51] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [20:28:53] (03CR) 10Chad: [C: 031] "Yeah, this isn't a must-be-quoted character. I was being overly cautious." [puppet] - 10https://gerrit.wikimedia.org/r/418990 (owner: 10Paladox) [20:31:37] (03CR) 10Rush: [C: 032] openstack: dealing with libvirt differences trusty vs jessie [puppet] - 10https://gerrit.wikimedia.org/r/418989 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:32:07] (03PS11) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [20:36:11] (03PS1) 10Rush: openstack: nova::compute::audit change [puppet] - 10https://gerrit.wikimedia.org/r/418991 (https://phabricator.wikimedia.org/T188266) [20:36:17] !log updated wikitech-static as detailed in https://wikitech.wikimedia.org/wiki/Wikitech-static#Manual_updates [20:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:48] (03PS3) 10BBlack: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [20:37:47] !log arlolra@tin Finished deploy [parsoid/deploy@174c87d]: Updating Parsoid to 16ced34 (duration: 10m 16s) [20:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:54] andrewbogott: that alerted (not paged tho) fyi [20:39:57] chasemp: which? [20:40:16] andrewbogott: ** PROBLEM alert - Wikitech-static web interface/Wikitech-static main page is CRITICAL ** [20:40:40] huh, I marked it in downtime before there reboot [20:42:33] weird, maybe didn't quite catch it [20:44:51] !log Updated Parsoid to 16ced34 (T188670, T90902) [20:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:00] T90902: Non-breaking space in header ID breaks anchor - https://phabricator.wikimedia.org/T90902 [20:45:00] T188670: Expecting : in parser function definiton - https://phabricator.wikimedia.org/T188670 [20:45:10] andrewbogott: apparently that was shinken, so awesome (/s) [20:45:43] So it's checked in two places… should probably just kill that shinken check [20:46:11] (03CR) 10Krinkle: [C: 031] "LGTM. A few observations from testing in screen for a few minutes:" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [20:46:43] (03PS12) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [20:49:47] (03Abandoned) 10Rush: openstack: nova::compute::audit change [puppet] - 10https://gerrit.wikimedia.org/r/418991 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:50:40] (03PS13) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [20:51:42] (03CR) 10Krinkle: [C: 031] "Lastly: Wasn't able to cleanly close the process with ctrl-C. See updated gist." [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [20:56:13] (03PS4) 10BBlack: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [20:56:15] (03PS3) 10BBlack: varnishslowlog: add fetch overhead introduced by varnish [puppet] - 10https://gerrit.wikimedia.org/r/418603 (owner: 10Ema) [20:56:53] (03CR) 10jerkins-bot: [V: 04-1] varnishslowlog: add fetch overhead introduced by varnish [puppet] - 10https://gerrit.wikimedia.org/r/418603 (owner: 10Ema) [20:59:30] (03PS4) 10BBlack: varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (owner: 10Ema) [21:00:04] bawolff and Reedy: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180312T2100). [21:00:04] No GERRIT patches in the queue for this window AFAICS. [21:17:11] (03PS1) 10Rush: openstack: labtestn neutron* host eth1.2120 ip [puppet] - 10https://gerrit.wikimedia.org/r/419016 (https://phabricator.wikimedia.org/T188266) [21:17:53] (03CR) 10Krinkle: varnishslowlog: filter on all timestamps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [21:19:15] (03CR) 10Rush: [C: 032] openstack: labtestn neutron* host eth1.2120 ip [puppet] - 10https://gerrit.wikimedia.org/r/419016 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:22:01] (03CR) 10Krinkle: varnishslowlog: add Backend-Timing D=, in seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418603 (owner: 10Ema) [21:22:33] (03CR) 10Krinkle: "Might want to mention T131894 in the commit message. Not exactly the same in terms of implementation, but would certainly serve the same p" [puppet] - 10https://gerrit.wikimedia.org/r/418603 (owner: 10Ema) [21:23:06] (03PS1) 10Rush: openstack: labtestn/net role uncomment config [puppet] - 10https://gerrit.wikimedia.org/r/419018 (https://phabricator.wikimedia.org/T188266) [21:23:18] mutante: vgutierrez: could use a merge on https://gerrit.wikimedia.org/r/#/c/417221/ [21:25:33] (03CR) 10Chad: [C: 031] "Can we please merge so we can avoid having this cherry-picked for awhile. It's beta only" [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis) [21:26:04] (03CR) 10Rush: [C: 032] openstack: labtestn/net role uncomment config [puppet] - 10https://gerrit.wikimedia.org/r/419018 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:30:34] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4044767 (10Paladox) Needs package libsodium23 to also be imported into php 7.2 in thirdparty/component php7.2-cli : De... [21:38:43] 10Operations, 10Cassandra, 10Services (doing), 10User-Eevans: Test/upload new cassandra 2.2.6 package (wmf3) - https://phabricator.wikimedia.org/T189529#4044780 (10Eevans) p:05Triage>03Normal [21:40:32] (03CR) 10Jalexander: [C: 031] "Yeah, if we can get this out soon it would be best" [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis) [21:47:17] (03PS5) 10Dzahn: Icinga: Add WebPageReplay Grafana performance alerts [puppet] - 10https://gerrit.wikimedia.org/r/417221 (https://phabricator.wikimedia.org/T188988) (owner: 10Phedenskog) [21:48:21] (03CR) 10Dzahn: [C: 032] Icinga: Add WebPageReplay Grafana performance alerts [puppet] - 10https://gerrit.wikimedia.org/r/417221 (https://phabricator.wikimedia.org/T188988) (owner: 10Phedenskog) [21:55:27] (03PS2) 10Andrew Bogott: beta: Enable password authn for Beta Cluster logstash [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis) [21:56:11] (03CR) 10Andrew Bogott: [C: 032] beta: Enable password authn for Beta Cluster logstash [puppet] - 10https://gerrit.wikimedia.org/r/418986 (https://phabricator.wikimedia.org/T161051) (owner: 10BryanDavis) [22:04:01] (03PS1) 10Ottomata: Blacklist InputDeviceDynamics schema from Hive refine [puppet] - 10https://gerrit.wikimedia.org/r/419076 [22:04:10] Hi ops! [22:04:27] Do you know who's responsible for the ExtensionDistributor on mw.org? [22:04:36] It's giving 404s for all the download links [22:04:38] e.g. [22:04:39] https://www.mediawiki.org/wiki/Special:ExtensionDistributor?extdistname=CentralNotice&extdistversion=master [22:04:55] (03CR) 10Ottomata: [C: 032] Blacklist InputDeviceDynamics schema from Hive refine [puppet] - 10https://gerrit.wikimedia.org/r/419076 (owner: 10Ottomata) [22:05:37] ejegg: can you try again? I just clicked and gives me a download link [22:07:01] works for me too [22:07:27] Platonides: fixed that JS thing already on Ext:WikiLovesMonuments? :P [22:07:52] I could try but I'm not sure if creating a JS file and moving that there would be enough [22:08:01] Hauskatze: ah, interesting, it's working for me now [22:08:03] like ext.wlm.js [22:08:18] ejegg: blame the [[:es:Wikipedia:Cojuelo]] [22:08:22] (03PS1) 10MaxSem: Disable ArticleCreationWorkflow, ACTRIAL ends on the 14th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419077 (https://phabricator.wikimedia.org/T186570) [22:08:29] I guess there's a delay getting the snapshot made [22:13:45] (03PS14) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [22:17:21] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Add support for stretch in the phabricator puppet class - https://phabricator.wikimedia.org/T187127#4044877 (10Paladox) We will also need to import php-apcu and php-mailparse from there too. [22:31:48] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4044914 (10Dzahn) I heard that repos now exist. Could you update the ticket with the repo names please? [22:35:39] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4044925 (10Volker_E) @Dzahn The repos are: https://gerrit.wikimedia.org/r/#/projects/design/landing-page for the root https://desi... [22:43:12] (03PS15) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [22:48:32] (03PS4) 10Dzahn: microsites::design: enable cloning from 2 new repos [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) [22:49:12] (03CR) 10Dzahn: "ready to go - except pending security review https://phabricator.wikimedia.org/T188698" [puppet] - 10https://gerrit.wikimedia.org/r/415748 (https://phabricator.wikimedia.org/T185282) (owner: 10Dzahn) [22:50:43] (03PS4) 10Dzahn: Gerrit: Remove quotes around cookiePath [puppet] - 10https://gerrit.wikimedia.org/r/418990 (owner: 10Paladox) [22:56:34] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4044983 (10Volker_E) [22:57:12] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Volker_E) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180312T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:03:56] (03CR) 10Dzahn: [C: 032] Gerrit: Remove quotes around cookiePath [puppet] - 10https://gerrit.wikimedia.org/r/418990 (owner: 10Paladox) [23:04:00] thanks :) [23:08:59] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4045049 (10RobH) Ok, I misparsed all of that. So next steps: 1) Chris traces out and sees what is connected to ge-4/0/18. It somehow has the same IP address as wdqs1004 and is ju... [23:12:05] (03PS1) 10Dzahn: gerrit: skip gerrit process monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/419080 (https://phabricator.wikimedia.org/T176532) [23:12:35] (03CR) 10jerkins-bot: [V: 04-1] gerrit: skip gerrit process monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/419080 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [23:13:17] (03CR) 10Chad: "Long term I disagree with this, but I understand what's the motivation now." [puppet] - 10https://gerrit.wikimedia.org/r/419080 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [23:13:33] (03PS2) 10Dzahn: gerrit: skip gerrit process monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/419080 (https://phabricator.wikimedia.org/T176532) [23:18:07] (03CR) 10Dzahn: "i should probably add comments that this is to be reverted once gerrit2001 actually has a DB" [puppet] - 10https://gerrit.wikimedia.org/r/419080 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [23:26:33] 10Operations, 10Goal, 10HHVM: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#4045106 (10Andrew) [23:27:25] 10Operations, 10Cloud-Services, 10Cloud-VPS: Silver anomalies - https://phabricator.wikimedia.org/T151486#4045111 (10Andrew) [23:27:28] 10Operations, 10cloud-services-team: silver: / partition low on space - https://phabricator.wikimedia.org/T151493#4045108 (10Andrew) 05Open>03Resolved a:03Andrew This should be moot as Silver is no longer doing important things. [23:28:09] 10Operations, 10Cloud-Services, 10Cloud-VPS: Silver anomalies - https://phabricator.wikimedia.org/T151486#2818993 (10Andrew) [23:28:12] 10Operations, 10Cloud-Services, 10Cloud-VPS: silver: /dev/md2 mounted twice - https://phabricator.wikimedia.org/T151489#4045118 (10Andrew) 05Open>03declined Soon silver will be decommissioned. [23:28:26] 10Operations, 10Cloud-Services, 10Cloud-VPS: Silver anomalies - https://phabricator.wikimedia.org/T151486#2818993 (10Andrew) 05Open>03declined [23:30:03] (03PS1) 10Andrew Bogott: remove outdated references to wikitech on silver [puppet] - 10https://gerrit.wikimedia.org/r/419082 (https://phabricator.wikimedia.org/T168559) [23:30:36] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:30:36] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:30:57] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:06] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:07] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:17] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:17] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:36] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:31:46] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:32:36] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:32:36] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:32:39] (03PS1) 10Gergő Tisza: Enable Wikidata description override on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419083 (https://phabricator.wikimedia.org/T184000) [23:32:46] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:32:47] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:33:16] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:33:48] (03CR) 10Gergő Tisza: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [23:33:57] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:33:57] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:34:00] (03PS2) 10Gergő Tisza: Enable Wikidata description override on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/418843 (https://phabricator.wikimedia.org/T184000) [23:34:16] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:34:16] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:34:36] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:35:02] 10Operations, 10Cloud-Services, 10Stashbot: Make morebots run on a production host - https://phabricator.wikimedia.org/T94638#4045136 (10Andrew) 05Open>03Invalid morebots doesn't run anymore, cleaning this up. [23:36:20] (03PS2) 10Andrew Bogott: remove outdated references to wikitech on silver [puppet] - 10https://gerrit.wikimedia.org/r/419082 (https://phabricator.wikimedia.org/T168559) [23:36:49] (03CR) 10Andrew Bogott: [C: 032] remove outdated references to wikitech on silver [puppet] - 10https://gerrit.wikimedia.org/r/419082 (https://phabricator.wikimedia.org/T168559) (owner: 10Andrew Bogott) [23:38:36] (03CR) 10BBlack: varnishslowlog: filter on all timestamps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418580 (owner: 10Ema) [23:40:30] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#3994382 (10Platonides) Well, if the server itself is needed, it will be doing its work with a different IP address than the one of wdqs1004, since it would have been suffering the s... [23:40:33] (03PS3) 10Dzahn: gerrit: skip gerrit process monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/419080 (https://phabricator.wikimedia.org/T176532) [23:41:16] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:41:36] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:42:56] (03PS4) 10Dzahn: gerrit: skip gerrit process monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/419080 (https://phabricator.wikimedia.org/T176532) [23:44:13] (03PS3) 10Andrew Bogott: Stop forcing php5 in `mwscript` [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad) [23:44:23] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10414/" [puppet] - 10https://gerrit.wikimedia.org/r/419080 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [23:45:15] (03CR) 10Andrew Bogott: [C: 032] Stop forcing php5 in `mwscript` [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad) [23:45:23] (03PS4) 10Andrew Bogott: Stop forcing php5 in `mwscript` [puppet] - 10https://gerrit.wikimedia.org/r/358896 (https://phabricator.wikimedia.org/T146285) (owner: 10Chad) [23:45:53] (03PS5) 10BBlack: varnishslowlog: filter on all timestamps [puppet] - 10https://gerrit.wikimedia.org/r/418580 (https://phabricator.wikimedia.org/T181315) (owner: 10Ema) [23:45:55] (03PS5) 10BBlack: varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema) [23:46:13] (03CR) 10BBlack: varnishslowlog: add Backend-Timing D=, in seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema) [23:47:58] (03PS6) 10BBlack: varnishslowlog: add Backend-Timing D=, in seconds [puppet] - 10https://gerrit.wikimedia.org/r/418603 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema) [23:51:03] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:52:28] 10Operations, 10DBA, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4045167 (10Dzahn) p:05Low>03Normal We still want this just as before. We were just as... [23:52:50] (03PS1) 10Dzahn: base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) [23:53:49] (03CR) 10jerkins-bot: [V: 04-1] base/icinga: add Hiera override to skip systemd monitoring [puppet] - 10https://gerrit.wikimedia.org/r/419084 (https://phabricator.wikimedia.org/T176532) (owner: 10Dzahn) [23:55:10] (03PS1) 10Dzahn: gerrit: skip systemd monitoring on gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/419086 (https://phabricator.wikimedia.org/T176532) [23:58:19] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:59:00] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [23:59:00] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:59:02] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4045191 (10tstarling) So we need a Greenhouse admin to go to Configure > Email Settings, then enter "careers.wikimedia.org" for the domain and click "Register".... [23:59:19] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:59:19] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:59:39] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures