[00:00:06] (03PS1) 10Dereckson: maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) [00:01:22] (03CR) 10jerkins-bot: [V: 04-1] maintenance: provision /etc/wgetrc [puppet] - 10https://gerrit.wikimedia.org/r/341264 (https://phabricator.wikimedia.org/T159661) (owner: 10Dereckson) [00:02:20] 06Operations, 06Commons, 13Patch-For-Review: Improve Terbium userland to process server side uploads - https://phabricator.wikimedia.org/T159661#3074318 (10Dereckson) p:05Triage>03Normal [00:12:19] Dereckson: not all files were uploaded [00:15:04] matanya: yes, there are some dupes: https://phabricator.wikimedia.org/T159650#3074332 [00:15:19] +VERED_SHEFER.jpg already exists as Vered_Shefer_(172144664).jpg, skipping [00:15:28] ah, thanks. Dereckson i have another batch, can it be done as well ? [00:15:37] (added to the list) [00:15:37] sure [00:15:48] adding the list to the ticket [00:21:34] Dereckson: updated, txt files already ready there [00:28:58] Dereckson: CHALED KABUB is missing from the first import for some reason [00:29:51] CHALED_KABUB.jpg is the file name [00:30:06] ok [00:32:54] same for YIGAL_MERSEL.jpg [00:37:19] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:44:11] MOSHE_SOBEL.jpg is missing as welll [00:48:35] and ISSAIYHO_SCHNELLER.jpg [01:05:19] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [01:17:13] Dereckson: thanks CHALED KABUB.jpg is still missing [01:17:20] should i do manually ? [01:19:19] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [01:22:19] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:29:39] matanya: for one, yes, please [01:35:17] (03PS1) 10BearND: Limit FeaturedFeed on dewiki to last seven days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) [01:38:47] (03PS2) 10BearND: Limit FeaturedFeed on dewiki to last seven days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341267 (https://phabricator.wikimedia.org/T159664) [02:18:39] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4200086 keys, up 125 days 17 hours - replication_delay is 49 [02:19:05] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.14) (duration: 07m 15s) [02:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:24] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Mar 6 02:24:24 UTC 2017 (duration 5m 19s) [02:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:39] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 649 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4200086 keys, up 125 days 17 hours - replication_delay is 649 [03:05:09] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1807.545581 Seconds [03:05:09] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1807.582256 Seconds [03:05:09] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 1812.289991 Seconds [03:06:09] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:06:09] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:06:10] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 5.401882 Seconds [03:16:39] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4198740 keys, up 125 days 18 hours - replication_delay is 0 [03:18:29] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4198276 keys, up 125 days 18 hours - replication_delay is 36 [03:32:09] PROBLEM - puppet last run on wtp1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:09] RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [04:06:09] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:18:19] PROBLEM - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:11 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:12, Controller, Battery/Capacitor [04:18:21] ACKNOWLEDGEMENT - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:11 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:12, Controller, Battery/Capacitor nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T159665 [04:18:29] 06Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3074430 (10ops-monitoring-bot) [04:27:48] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3074434 (10tstarling) >>! In T156924#3072673, @Krinkle wrote: > Just so I understand, you're pro... [04:33:45] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3074436 (10tstarling) However, I am not trying to block this either way. I think either approach... [04:35:09] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:42:43] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341132 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [06:44:14] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341132 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [06:44:28] (03CR) 10jenkins-bot: db-codfw.php: Depool db2046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341132 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [06:44:30] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3074464 (10Marostegui) [06:45:15] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3074430 (10Marostegui) a:03Papaul Hi @Papaul - please change the disk once you have time for it! Thanks! [06:46:32] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2046 - T159414 (duration: 00m 51s) [06:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:38] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [06:47:29] 06Operations, 10ops-codfw, 10DBA: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3074469 (10Marostegui) [06:55:41] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3074484 (10Joe) >>! In T156924#3072056, @tstarling wrote: >>>! In T156924#3070042, @Joe wrote: >... [06:59:18] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3074485 (10Joe) As a general comment on the rest of the thread: we don't plan to store more th... [06:59:56] !log Deploy ALTER table on db2046 (s6) for the revision table - T159414 [07:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:03] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [07:05:16] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3074488 (10Joe) >>! In T156924#3072673, @Krinkle wrote: > and caching in APC sounds like it woul... [07:11:43] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3074491 (10Joe) >>! In T156924#3074434, @tstarling wrote: > > I don't think it will be possible... [07:22:04] !log Resume pt-table-checksum on plwiki (s2) - T154485 [07:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:09] T154485: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 - https://phabricator.wikimedia.org/T154485 [07:36:20] marostegui: o/ :D :D [07:36:35] elukey: you missed the morning alters? :) [07:53:24] :D [08:33:01] 06Operations, 07HHVM: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3074551 (10MoritzMuehlenhoff) The test failure is benign; it tests a new feature introduced into the simple JSON parser unconditionally of whether json-c is used or not. Since Debian uses json-c for license reasons,... [08:48:45] 06Operations, 10ops-eqiad, 10DBA: db1047 BBU RAID issues (was: Investigate db1047 replication lag) - https://phabricator.wikimedia.org/T159266#3074566 (10Marostegui) Update - @Joe is kindly helping and we have seen a few issues which suggests that the storage itself might be having issues: ``` [52078943.9540... [08:48:57] gah [08:48:58] wrong ticket [08:49:33] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3074568 (10Marostegui) Update - @Joe is kindly helping and we have seen a few issues which suggests that the storage itself might be having issues: ``` [52078943.954044] ata1.00: BMD... [08:54:39] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:09:13] !log killing stuck tilerator notification on maps-test2001 - T145534 [09:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:19] T145534: maps - tilerator notification seems stuck on sorting files - https://phabricator.wikimedia.org/T145534 [09:11:37] (03PS1) 10Muehlenhoff: Add two more accounts with LDAP NDA access [puppet] - 10https://gerrit.wikimedia.org/r/341280 [09:11:50] (03CR) 10ArielGlenn: [C: 032] Move default config into a file [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [09:12:06] (03CR) 10ArielGlenn: [C: 032] use single config object for all conf setting lookups [dumps] - 10https://gerrit.wikimedia.org/r/340759 (owner: 10ArielGlenn) [09:12:47] (03PS1) 10Alexandros Kosiaris: Update akosiaris dot files [puppet] - 10https://gerrit.wikimedia.org/r/341281 [09:14:34] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3074604 (10Marostegui) For the record: The above disk broken looks sda, which is not part of the baculads volume but `/`. Although mdstat doesn't see it broken. ``` root@helium:/var/... [09:14:53] !log ariel@tin Started deploy [dumps/dumps@04794df]: move default config into a file and clean up [09:14:55] !log ariel@tin Finished deploy [dumps/dumps@04794df]: move default config into a file and clean up (duration: 00m 02s) [09:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:39] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:19:07] (03CR) 10Muehlenhoff: [C: 032] Add two more accounts with LDAP NDA access [puppet] - 10https://gerrit.wikimedia.org/r/341280 (owner: 10Muehlenhoff) [09:19:37] (03PS7) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [09:20:22] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3074635 (10Jdrewniak) > before the patch the following answers are expected: > * https://www.wikipedia.org/ -> 200 OK... [09:22:32] <_joe_> akosiaris: welcome back :) [09:22:39] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:22:48] hello [09:25:18] (03PS8) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [09:28:23] moritzm: FYI looks like there are unmerged chnages in puppetmaster [09:28:32] upps, merging [09:29:15] (03PS9) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [09:31:34] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3073330 (10Lydia_Pintscher) Anything else we need to do here? [09:33:08] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3074648 (10Joe) @Lydia_Pintscher not really, I'm monitoring the jobqueue and it's constantly decreasing in size. We should be ok. [09:33:27] (03PS1) 10Muehlenhoff: Change email address for Wes Moran [puppet] - 10https://gerrit.wikimedia.org/r/341283 [09:33:52] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3074652 (10Lydia_Pintscher) Cool. Thanks! [09:38:46] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3074691 (10Jdrewniak) oh weird I guess https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-non-existing.js... [09:41:40] (03PS10) 10Filippo Giunchedi: [WIP] prometheus: add snmp_exporter module and role [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [09:43:09] (03PS2) 10Alexandros Kosiaris: Update akosiaris dot files [puppet] - 10https://gerrit.wikimedia.org/r/341281 [09:43:15] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update akosiaris dot files [puppet] - 10https://gerrit.wikimedia.org/r/341281 (owner: 10Alexandros Kosiaris) [09:45:39] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:46:31] !log postgresql upgrade on maps-test* (postgresql-9.4 postgresql-9.4-postgis-2.3 postgresql-9.4-postgis-2.3-scripts postgresql-client-9.4 postgresql-client-common postgresql-common postgresql-contrib-9.4) [09:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:09] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Puppet has 23 failures. Last run 2 minutes ago with 23 failures. Failed resources (up to 3 shown): File[/home/akosiaris/.vim/indent/puppet.vim],File[/home/akosiaris/.vim/bundle/solarized/colors/solarized.vim],File[/home/akosiaris/.vim/ftplugin/javascript.vim],File[/home/akosiaris/.vim/ftplugin/puppet.vim] [09:47:19] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/home/akosiaris/.vim/ftplugin/ruby.vim],File[/home/akosiaris/.vim/ftplugin/puppet_tab.vim],File[/home/akosiaris/.dir_colors/dircolors],File[/home/akosiaris/.vim/autoload/pathogen.vim] [09:47:19] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/akosiaris/.vim/bundle/solarized/colors/solarized.vim] [09:47:19] PROBLEM - puppet last run on thumbor1002 is CRITICAL: CRITICAL: Puppet has 15 failures. Last run 2 minutes ago with 15 failures. Failed resources (up to 3 shown): File[/home/akosiaris/.vim/indent/puppet.vim],File[/home/akosiaris/.vim/bundle/solarized/colors/solarized.vim],File[/home/akosiaris/.vim/ftplugin/puppet_tab.vim],File[/home/akosiaris/.vim/ftplugin/puppet.vim] [09:47:30] akosiaris: welcome back! [09:47:34] welcome back akosiaris XD [09:47:39] PROBLEM - puppet last run on rdb2005 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): File[/home/akosiaris/.vim/ftplugin/ruby.vim],File[/home/akosiaris/.vim/ftplugin/puppet.vim],File[/home/akosiaris/.vim/ftplugin/python.vim],File[/home/akosiaris/.dir_colors/dircolors] [09:47:44] gehel: heh indeed... [09:48:06] :-) [09:48:15] first day is always hard... [09:48:47] yup [09:49:06] although I must point out it's slightly easier than when getting back from vacation [09:49:25] maybe cause I did not really get any rest and relaxation [09:49:32] work can be so relaxing in some contexts... [09:49:37] yes! [09:49:49] 06Operations: Superfluous rsyncd on ruthenium - https://phabricator.wikimedia.org/T159676#3074737 (10MoritzMuehlenhoff) [09:51:48] (03PS1) 10ArielGlenn: fix uninitialized var bug for retry of broken runs under rare conditions [dumps] - 10https://gerrit.wikimedia.org/r/341286 [09:52:07] so.. bast3001.. dead or not ? [09:52:33] akosiaris: dead! 3002 is the new thing [09:52:52] ok.. and why is icinga still happily reporting it as up and SSH working ? [09:52:59] * akosiaris puzzled [09:53:29] that I don't know, maybe not yet off [09:53:32] akosiaris: almost dead, it will be decom by m*tante in the next days [09:53:42] I have to go, bbl [09:54:39] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:56:36] I think it's in the decom queue for the dc folks, is my understanding https://phabricator.wikimedia.org/T159480 [09:58:19] they've been pretty clear about wanting to own that process once it gets to the "non-interruptable steps" as it says on the ticket (see server lifecycle page about which steps get done by who) [10:04:34] (03CR) 10ArielGlenn: [C: 032] fix uninitialized var bug for retry of broken runs under rare conditions [dumps] - 10https://gerrit.wikimedia.org/r/341286 (owner: 10ArielGlenn) [10:06:19] !log ariel@tin Started deploy [dumps/dumps@8521be0]: fix: retries of broken runs could except on uninited var [10:06:21] !log ariel@tin Finished deploy [dumps/dumps@8521be0]: fix: retries of broken runs could except on uninited var (duration: 00m 01s) [10:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:20] !log postgresql upgrade on maps* (postgresql-9.4 postgresql-9.4-postgis-2.3 postgresql-9.4-postgis-2.3-scripts postgresql-client-9.4 postgresql-client-common postgresql-common postgresql-contrib-9.4) [10:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:19] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:15:09] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:15:19] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:15:19] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:15:39] RECOVERY - puppet last run on rdb2005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:22:39] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:23:09] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:24:45] !log (shamefully) replaced /etc/init.d/hadoop-hdfs-datanode script with "exit 0" to prevent the HDFS datanode daemon to start on analytics1028 (broken disk) and leave the rest running (puppet included) - T159632 [10:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:51] T159632: Analytics1028 hdfs daemon died because of disk errors - https://phabricator.wikimedia.org/T159632 [10:29:46] (03PS2) 10Marostegui: production.my.cnf: Enable gtid_domaid_id [puppet] - 10https://gerrit.wikimedia.org/r/340130 (https://phabricator.wikimedia.org/T149418) [10:30:28] (03PS4) 10Gehel: relforge: use experimental apt repo to have access to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/340500 (https://phabricator.wikimedia.org/T159168) [10:31:50] (03CR) 10Gehel: [C: 032] relforge: use experimental apt repo to have access to elasticsearch 5 [puppet] - 10https://gerrit.wikimedia.org/r/340500 (https://phabricator.wikimedia.org/T159168) (owner: 10Gehel) [10:32:31] (03PS2) 10Elukey: Allow analytics1040 to be reimaged with Debian Jessie [puppet] - 10https://gerrit.wikimedia.org/r/340980 (https://phabricator.wikimedia.org/T159530) [10:36:02] !log upgrade to elasticsearch 5.2.2 on relforge cluster - T156150 [10:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:07] T156150: Install ES 5.x to relforge100[12] - https://phabricator.wikimedia.org/T156150 [10:36:17] (03PS1) 10Muehlenhoff: Add ferm rules for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341290 [10:38:34] 06Operations: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3074977 (10Joe) [10:41:30] (03CR) 10Gehel: [C: 04-1] "the swift-repository-plugin should be removed for elasticsearch 5.x" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340977 (owner: 10DCausse) [10:43:42] (03PS2) 10DCausse: Upgrade to elastic 5.2.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340977 [10:46:07] !log upgrading apache on mediawiki servers in codfw [10:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:34] (03PS2) 10ArielGlenn: add api job handler, config file in yaml, siteinfo props jobs [dumps] - 10https://gerrit.wikimedia.org/r/338899 (https://phabricator.wikimedia.org/T38178) [10:46:55] (03CR) 10jerkins-bot: [V: 04-1] add api job handler, config file in yaml, siteinfo props jobs [dumps] - 10https://gerrit.wikimedia.org/r/338899 (https://phabricator.wikimedia.org/T38178) (owner: 10ArielGlenn) [10:47:09] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:50:19] (03PS3) 10ArielGlenn: add api job handler, config file in yaml, siteinfo props jobs [dumps] - 10https://gerrit.wikimedia.org/r/338899 (https://phabricator.wikimedia.org/T38178) [10:50:59] moritzm: I haven't seen a regression of 400s/50xs on mw1261 for the moment [10:51:06] I am checking access logs but it seems fine [10:51:45] error logs are fine [10:52:04] I am checking 1262 just for confirmation but the new code looks good [10:53:03] 06Operations: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3075014 (10Joe) Just to give some context: it might be possible to try to have a true multi-dc cluster for etcd, but that will need: - N machines in eqiad - N machines in codfw - 1 or 2 tiebreakers, probably in ULSFO, for acc... [10:54:09] (03CR) 10Elukey: [C: 032] Allow analytics1040 to be reimaged with Debian Jessie [puppet] - 10https://gerrit.wikimedia.org/r/340980 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [10:56:09] elukey: yeah, agreed. going ahead with codfw right now and will move on to eqiad later the day (at least partly) [10:57:06] super [10:59:06] (03PS1) 10Muehlenhoff: role::analytics_cluster::hadoop::standby: Enable base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/341292 [11:00:34] (03PS6) 10Giuseppe Lavagetto: mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5 [puppet] - 10https://gerrit.wikimedia.org/r/340539 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [11:02:37] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5 [puppet] - 10https://gerrit.wikimedia.org/r/340539 (https://phabricator.wikimedia.org/T156922) (owner: 10Krinkle) [11:04:24] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3070663 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1040.eqiad.... [11:05:07] !log reimage the first Hadoop worker node (an1040) to Debian Jessie [11:05:12] \o/ [11:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:31] \o/ [11:14:29] * _joe_ goes to file the "upgrade hadoop to stretch" task and files it under technical debt [11:15:07] 06Operations, 10netops: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#3075079 (10akosiaris) >>! In T156109#2995356, @Dzahn wrote: > I realize this might be on your last day before you are away for a while, please feel free to put up for grabs and i'll as... [11:16:09] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [11:18:55] lol [11:22:53] (03PS1) 10Addshore: elasticsearch requires $plugins_dir to exist [puppet] - 10https://gerrit.wikimedia.org/r/341295 [11:24:45] (03PS2) 10Addshore: elasticsearch requires $plugins_dir to exist [puppet] - 10https://gerrit.wikimedia.org/r/341295 [11:25:53] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch requires $plugins_dir to exist [puppet] - 10https://gerrit.wikimedia.org/r/341295 (owner: 10Addshore) [11:26:09] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:26:38] (03PS1) 10Giuseppe Lavagetto: Add --strip [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/341296 [11:30:13] !log upgrading apache on planet.wikimedia.org [11:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:20] !log upgrading apache on krypton [11:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:42] (03PS3) 10Gehel: Upgrade to elastic 5.2.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340977 (owner: 10DCausse) [11:42:07] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3075096 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1040.eqiad.wmnet'] ``` and were **ALL** successful. [11:42:24] (03PS1) 10Addshore: elasticsearch init $data_dir creation requires installed package [puppet] - 10https://gerrit.wikimedia.org/r/341297 [11:42:42] (03CR) 10Gehel: "I would have though that link target was autorequired. It seems I was wrong..." [puppet] - 10https://gerrit.wikimedia.org/r/341295 (owner: 10Addshore) [11:42:59] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3075097 (10Lydia_Pintscher) >>! In T150183#3071633, @Addshore wrote: > So, as far as I can see this is re... [11:43:40] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch init $data_dir creation requires installed package [puppet] - 10https://gerrit.wikimedia.org/r/341297 (owner: 10Addshore) [11:45:17] (03PS3) 10Gehel: elasticsearch requires $plugins_dir to exist [puppet] - 10https://gerrit.wikimedia.org/r/341295 (owner: 10Addshore) [11:45:46] (03PS4) 10Gehel: elasticsearch requires $plugins_dir to exist [puppet] - 10https://gerrit.wikimedia.org/r/341295 (owner: 10Addshore) [11:47:11] (03CR) 10Gehel: [C: 032] elasticsearch requires $plugins_dir to exist [puppet] - 10https://gerrit.wikimedia.org/r/341295 (owner: 10Addshore) [11:49:13] (03PS2) 10Gehel: elasticsearch init $data_dir creation requires installed package [puppet] - 10https://gerrit.wikimedia.org/r/341297 (owner: 10Addshore) [11:49:19] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341299 [11:49:25] !log installing imagemagick security updates [11:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:32] (03PS3) 10Gehel: elasticsearch init $data_dir creation requires installed package [puppet] - 10https://gerrit.wikimedia.org/r/341297 (owner: 10Addshore) [11:50:20] 06Operations, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3075108 (10Joe) p:05Triage>03Normal [11:50:29] PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:50:57] (03CR) 10Gehel: [C: 032] elasticsearch init $data_dir creation requires installed package [puppet] - 10https://gerrit.wikimedia.org/r/341297 (owner: 10Addshore) [11:51:09] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:09] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:29] elasticsearch puppet error is me, fix coming up [11:53:29] PROBLEM - puppet last run on elastic2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:09] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:10] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:10] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:29] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:29] PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:29] PROBLEM - puppet last run on elastic2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:39] PROBLEM - puppet last run on elastic2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:54:40] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:55:08] (03PS1) 10Gehel: Revert "elasticsearch requires $plugins_dir to exist" [puppet] - 10https://gerrit.wikimedia.org/r/341301 [11:55:09] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:55:20] (03CR) 10Gehel: [V: 032 C: 032] Revert "elasticsearch requires $plugins_dir to exist" [puppet] - 10https://gerrit.wikimedia.org/r/341301 (owner: 10Gehel) [11:55:29] PROBLEM - puppet last run on elastic2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:55:31] gehel: interesting.... [11:55:50] addshore: obvious in retrospec... [11:56:09] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:09] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:14] that plugin dir is created indirectly by scap [11:56:28] actually trbuchet [11:56:29] PROBLEM - puppet last run on elastic2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:56:53] but should the require still not be there? [11:57:09] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:57:24] testing on labs, using role::toollabs::elasticsearch that directory is never created [11:57:39] I guess that should ensure the directory is there then? [11:57:40] not a require on the file resource, but a require on package['elasticsearch/plugins'] [11:58:19] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:58:29] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:58:57] addshore: which is declared in the role... some cleanup is needed to move the right resources to the right place... [11:59:18] ack [11:59:45] (03PS11) 10MarcoAurelio: Rename 'technician' to 'interface-editor' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308281 (https://phabricator.wikimedia.org/T144638) [12:02:10] addshore: are you trying to deploy elasticsearch 5? [12:02:17] or 2.x ? [12:02:24] that was with 2 [12:03:03] essentially got it works after 3 puppet runs and creating the /srv/deployment/elasticsearch/plugins dir [12:03:07] *working [12:03:15] tarrow ^^ [12:03:49] we are in the progress of upgrading to 5.x, there is some puppet code to be compatible with 2.x / 5.x that I'd like to remove as soon as we have upgraded everything to 5 [12:04:27] okay, sounds like the sort of place to not bother cleaning upp to much until after ;) << tarrow [12:05:00] ah [12:05:11] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3075124 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1040.eqiad.... [12:05:20] that one will need some cleanup anyway, your fix is in fact needed, but needs to be a little bit different... [12:06:05] (03PS17) 10Fdans: Changes to perf consumer of event logging events [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [12:06:49] I'm trying to deploy a replica of toollabs:elasticsearch so I can poke it to fix a problem I'm having with too many concurrent connections to the live cluster but hit quite a few snags [12:07:18] live cluster = live toollabs cluster not production [12:09:03] tarrow, addshore: it looks like toollabs::elasticsearch does not use any plugins, so our code to manage plugin directories just gets in the way. I should move it to the profiles that actually use plugins [12:09:40] 06Operations, 10MediaWiki-JobQueue, 10Wikidata: Job queue rising to nearly 3 million jobs - https://phabricator.wikimedia.org/T159618#3075143 (10Emijrp) >>! In T159618#3074646, @Lydia_Pintscher wrote: > Anything else we need to do here? @Joe @Lydia_Pintscher Is the Betacommand suggestion feasible? [12:16:11] (03PS1) 10Gehel: elasticsearch - move management of plugin directory symlink to role classes [puppet] - 10https://gerrit.wikimedia.org/r/341303 [12:19:29] RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:20:09] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:21:29] RECOVERY - puppet last run on elastic2028 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:21:29] RECOVERY - puppet last run on elastic2020 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [12:21:39] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [12:22:09] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:22:09] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:22:29] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:22:30] RECOVERY - puppet last run on elastic2026 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [12:22:39] RECOVERY - puppet last run on elastic2021 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:23:09] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:24:29] RECOVERY - puppet last run on elastic2014 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:24:29] RECOVERY - puppet last run on elastic2006 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:25:09] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:25:09] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:26:19] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:26:29] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:38:53] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Reimage a Trusty Hadoop worker to Debian jessie - https://phabricator.wikimedia.org/T159530#3075210 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1040.eqiad.wmnet'] ``` and were **ALL** successful. [12:40:53] (03CR) 10Addshore: [C: 031] "+1 assuming /usr/share/elasticsearch/plugins still exists with the base puppet role, otherwise elasticsearch will still fail to load." [puppet] - 10https://gerrit.wikimedia.org/r/341303 (owner: 10Gehel) [12:41:45] (03CR) 10Gehel: "yes, /usr/share/elasticsearch/plugins is created by the debian package so it will always be there..." [puppet] - 10https://gerrit.wikimedia.org/r/341303 (owner: 10Gehel) [12:44:04] (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/5659/" [puppet] - 10https://gerrit.wikimedia.org/r/341303 (owner: 10Gehel) [12:44:09] !log upgrading apache on graphite* [12:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:11] (03CR) 10Gehel: [C: 032] elasticsearch - move management of plugin directory symlink to role classes [puppet] - 10https://gerrit.wikimedia.org/r/341303 (owner: 10Gehel) [12:45:34] !log upgrading apache on mw1209-mw1235 [12:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:59] (03PS1) 10Gehel: elasticsearch - cosmetic: remove final / from symlink target [puppet] - 10https://gerrit.wikimedia.org/r/341309 [12:50:15] (03CR) 10Gehel: [C: 032] elasticsearch - cosmetic: remove final / from symlink target [puppet] - 10https://gerrit.wikimedia.org/r/341309 (owner: 10Gehel) [12:53:52] (03Abandoned) 10Muehlenhoff: Allow LDAP access to corp mirrors from terbium [puppet] - 10https://gerrit.wikimedia.org/r/340119 (owner: 10Muehlenhoff) [12:55:29] (03PS1) 10Volans: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 [12:58:09] PROBLEM - puppet last run on db1095 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:58:13] (03CR) 10Volans: "I've also tested it with a draft of integration tests for the CLI that I have locally that uses Docker to spin up few instances and run Cu" [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 (owner: 10Volans) [12:58:36] 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#3075281 (10akosiaris) Do we have a verdict on this one ? [12:59:24] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341299 (owner: 10Marostegui) [13:01:06] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341299 (owner: 10Marostegui) [13:01:15] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341299 (owner: 10Marostegui) [13:02:15] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2046 - T159414 (duration: 00m 50s) [13:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:22] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [13:03:09] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3068.00 Read Requests/Sec=394.40 Write Requests/Sec=317.70 KBytes Read/Sec=39116.80 KBytes_Written/Sec=2264.40 [13:03:25] (03PS1) 10Marostegui: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341312 (https://phabricator.wikimedia.org/T159414) [13:04:54] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341312 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [13:06:07] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341312 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [13:07:15] (03CR) 10jenkins-bot: db-codfw.php: Depool db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341312 (https://phabricator.wikimedia.org/T159414) (owner: 10Marostegui) [13:07:22] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2060 - T159414 (duration: 00m 39s) [13:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:28] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [13:07:30] !log Deploy ALTER table on db2060 (s6) for the revision table - T159414 [13:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:31] marostegui: I don't want to step on your toes, but I have a few things I would like to merge in mw-conf, give me a ping when your done? / if I can do them side by side you? If not I'll just put them in swat [13:12:29] RECOVERY - carbon-cache@d service on graphite2001 is OK: OK - carbon-cache@d is active [13:12:29] RECOVERY - carbon-cache@a service on graphite2001 is OK: OK - carbon-cache@a is active [13:12:29] RECOVERY - carbon-cache@b service on graphite2001 is OK: OK - carbon-cache@b is active [13:12:29] RECOVERY - carbon-local-relay service on graphite2001 is OK: OK - carbon-local-relay is active [13:12:29] RECOVERY - carbon-cache@g service on graphite2001 is OK: OK - carbon-cache@g is active [13:12:30] RECOVERY - carbon-frontend-relay service on graphite2001 is OK: OK - carbon-frontend-relay is active [13:12:30] RECOVERY - carbon-cache@c service on graphite2001 is OK: OK - carbon-cache@c is active [13:12:31] RECOVERY - carbon-cache@e service on graphite2001 is OK: OK - carbon-cache@e is active [13:12:31] RECOVERY - carbon-cache@f service on graphite2001 is OK: OK - carbon-cache@f is active [13:12:39] RECOVERY - carbon-cache@h service on graphite2001 is OK: OK - carbon-cache@h is active [13:16:37] godog: is it you? :) [13:17:14] or a downtime just expired? :D [13:17:19] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:29] RECOVERY - Check systemd state on graphite2001 is OK: OK - running: The system is fully operational [13:17:49] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/srv/org/wikimedia] [13:18:16] I guess the downtime [13:20:35] 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#3075343 (10MoritzMuehlenhoff) Some higher level plans for structuring the repos are now collected at https://phabricator.wikimedia.org/T158583, input welcome. [13:20:49] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:22:09] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=162.00 Read Requests/Sec=153.20 Write Requests/Sec=3.90 KBytes Read/Sec=3395.60 KBytes_Written/Sec=329.60 [13:23:21] volans: it was me yeah [13:23:36] !log reenable puppet on graphite2001 [13:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:49] ok, great [13:27:11] RECOVERY - puppet last run on db1095 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:30:11] PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:32:54] jouncebot next [13:32:54] In 0 hour(s) and 27 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T1400) [13:33:11] zeljkof: I can do this EU swat if you would like! (I have 3 patches in it) [13:35:04] (03CR) 10Addshore: [C: 04-1] Update bs.wiktionary logo plus add HD version of it (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341031 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [13:35:12] Urbanecm: ^^ [13:35:51] (03CR) 10Addshore: [C: 031] Update sr.wikibooks logo plus add HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341027 (https://phabricator.wikimedia.org/T159534) (owner: 10Urbanecm) [13:35:58] (03CR) 10Addshore: [C: 031] Bs.wiktionary namespace changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341035 (https://phabricator.wikimedia.org/T159538) (owner: 10Urbanecm) [13:38:51] (03PS2) 10Brian Wolff: Add a CSP policy to foundationwiki to prevent privacy breach [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) [13:39:19] (03CR) 10Brian Wolff: "PS2: Fix source parameter name in report url" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) (owner: 10Brian Wolff) [13:39:38] (03PS1) 10Elukey: Fix partman recipe for Analytics Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/341318 (https://phabricator.wikimedia.org/T159530) [13:40:55] (03CR) 10Addshore: [C: 031] Change account creation throttle for idwiki to default (6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341263 (owner: 10Brian Wolff) [13:40:56] jouncebot: next [13:40:57] In 0 hour(s) and 19 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T1400) [13:42:08] hashar I can do swat (as 3 of them are mine) :) [13:42:41] sure :) [13:43:52] (03PS2) 10Addshore: Update bs.wiktionary logo plus add HD version of it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341031 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [13:44:58] (03PS3) 10Addshore: Update sr.wikibooks logo plus add HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341027 (https://phabricator.wikimedia.org/T159534) (owner: 10Urbanecm) [13:45:19] (03PS3) 10Addshore: Update bs.wiktionary logo plus add HD version of it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341031 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [13:45:21] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [13:45:33] (03PS4) 10Addshore: Bs.wiktionary namespace changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341035 (https://phabricator.wikimedia.org/T159538) (owner: 10Urbanecm) [13:45:58] (03PS9) 10Addshore: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [13:46:04] (03CR) 10Hashar: "Dont we want to set that on all rawHTML wikis? There are a few more:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) (owner: 10Brian Wolff) [13:46:13] (03PS3) 10Addshore: Add InterwikiSorting extension to prod extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341121 (https://phabricator.wikimedia.org/T150183) [13:46:13] addshore, if you are to able to start SWAT right now, it'll help me. [13:46:20] (03CR) 10Hashar: [C: 031] Change account creation throttle for idwiki to default (6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341263 (owner: 10Brian Wolff) [13:46:28] (03PS2) 10Addshore: Create extension1 db cluster for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341120 (https://phabricator.wikimedia.org/T156241) [13:46:39] (03PS2) 10Addshore: Add Cognate to labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341122 (https://phabricator.wikimedia.org/T156241) [13:46:56] (03PS2) 10Addshore: Change account creation throttle for idwiki to default (6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341263 (owner: 10Brian Wolff) [13:47:02] addshore: there is one of the logo change that is off [13:47:14] doesnt reference the proper file [13:47:15] (03PS3) 10Addshore: Add a CSP policy to foundationwiki to prevent privacy breach [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) (owner: 10Brian Wolff) [13:47:19] (03PS2) 10Elukey: Fix partman recipe for Analytics Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/341318 (https://phabricator.wikimedia.org/T159530) [13:47:21] hashar: I already fixed it?! [13:47:24] !!! [13:48:00] hashar: https://gerrit.wikimedia.org/r/#/c/341031/3/wmf-config/InitialiseSettings.php [13:48:13] ps1 was wrong, ps2 is fixed, ps3 is the rebase [13:48:31] thank you addshore for fixing. [13:48:32] (03CR) 10Hashar: [C: 031] Update bs.wiktionary logo plus add HD version of it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341031 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [13:48:40] Urbanecm: no problem! [13:48:46] just start :] [13:48:59] Okay, thank you. [13:49:13] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.073 second response time [13:50:40] Urbanecm: hashar will do! [13:50:43] (03PS3) 10Elukey: Fix partman recipe for Analytics Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/341318 (https://phabricator.wikimedia.org/T159530) [13:50:49] Ok [13:51:25] (03CR) 10Addshore: [C: 032] Update sr.wikibooks logo plus add HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341027 (https://phabricator.wikimedia.org/T159534) (owner: 10Urbanecm) [13:51:29] (03CR) 10Addshore: [C: 032] Update bs.wiktionary logo plus add HD version of it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341031 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [13:51:41] Urbanecm: I'll do the 2 logos patches together [13:51:45] (03CR) 10Brian Wolff: "Re Hashar: I think in the super long term we would want something like this on all wikis (See the CSP rfc). However for this I'm specifica" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) (owner: 10Brian Wolff) [13:51:47] addshore, okay [13:52:11] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [13:52:30] addshore, BTW if you can add third patch to the logos one, 339326. It isn't mine BTW but I guess it'll be easy to do for us [13:52:39] https://gerrit.wikimedia.org/r/#/c/339326/ [13:52:41] T158815 [13:52:42] T158815: Update logo for bs.wikipedia - https://phabricator.wikimedia.org/T158815 [13:53:04] Urbanecm: I'll come back to that one after everything else that is scheduled! [13:53:04] (03Merged) 10jenkins-bot: Update sr.wikibooks logo plus add HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341027 (https://phabricator.wikimedia.org/T159534) (owner: 10Urbanecm) [13:53:16] (03CR) 10jenkins-bot: Update sr.wikibooks logo plus add HD version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341027 (https://phabricator.wikimedia.org/T159534) (owner: 10Urbanecm) [13:53:17] addshore, okay [13:53:40] (03CR) 10Hashar: [C: 031] "All good to me so. Wasn't sure whether you might have missed the other use cases :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) (owner: 10Brian Wolff) [13:54:35] (03Merged) 10jenkins-bot: Update bs.wiktionary logo plus add HD version of it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341031 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [13:54:44] (03CR) 10jenkins-bot: Update bs.wiktionary logo plus add HD version of it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341031 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [13:55:06] Urbanecm: they are live on mwdebug1002 [13:56:20] addshore, seems it is ok [13:56:29] ack [13:58:45] !log addshore@tin Synchronized static/images/project-logos/: SWAT: srwikibooks & bswiktionary logos T159534 T159542 1/2 (duration: 00m 39s) [13:58:51] Urbanecm: ^^ [13:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:52] T159542: Update bs.wiktionary logo - https://phabricator.wikimedia.org/T159542 [13:58:52] T159534: Update sr.wikibooks logo - https://phabricator.wikimedia.org/T159534 [13:58:59] half way there, still have the settings to go! [13:59:11] RECOVERY - puppet last run on restbase-dev1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:59:21] addshore, ack [13:59:29] addshore: please do :) [13:59:47] zeljkof: {{doing}} :) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T1400). Please do the needful. [14:00:04] Urbanecm, dcausse, addshore, and bawolff: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [14:00:16] woo! [14:00:17] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: srwikibooks & bswiktionary logos T159534 T159542 2/2 (duration: 00m 39s) [14:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:28] Urbanecm: logos all done! [14:00:36] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2734568 (10Ladsgroup) Regarding GPU options. I just want to note that their drivers are propriety software and not open source (or partiall... [14:00:37] addshore, thank you [14:00:43] (03CR) 10Addshore: [C: 032] Bs.wiktionary namespace changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341035 (https://phabricator.wikimedia.org/T159538) (owner: 10Urbanecm) [14:02:36] (03Merged) 10jenkins-bot: Bs.wiktionary namespace changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341035 (https://phabricator.wikimedia.org/T159538) (owner: 10Urbanecm) [14:02:48] (03CR) 10jenkins-bot: Bs.wiktionary namespace changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341035 (https://phabricator.wikimedia.org/T159538) (owner: 10Urbanecm) [14:03:10] Urbanecm: the bswiktionary namespace change patch is on mwdebug1002, please check [14:03:19] Checking [14:04:26] addshore, working [14:04:33] Urbanecm: ack [14:05:13] o/ [14:05:15] syncing [14:05:24] dcausse: your up next :) [14:05:29] thanks :) [14:05:52] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:341035|Bs.wiktionary namespace changes]] T159538 (duration: 00m 40s) [14:05:58] Urbanecm: ^^ all done [14:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:59] T159538: Change bs.wiktionary sitename - https://phabricator.wikimedia.org/T159538 [14:06:03] addshore, thank you so much [14:06:12] (03CR) 10Addshore: [C: 032] Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [14:06:42] that phab task could use a rename, fyi [14:07:10] Urbanecm: ^^ [14:07:18] done [14:07:21] [= [14:07:29] (03Merged) 10jenkins-bot: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [14:08:01] dcausse: It's live on mwdebug1002! [14:08:06] addshore: testing [14:08:50] (03CR) 10jenkins-bot: Enable Translation memories multi-DC support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335824 (https://phabricator.wikimedia.org/T132076) (owner: 10DCausse) [14:10:58] addshore: looks good so far [14:11:03] dcausse: so far ;) [14:11:27] yes.. this will affect jobrunners but hard to tell if everything will be good yet :) [14:12:13] syncing [14:12:16] ok [14:13:05] !log addshore@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:335824|Enable Translation memories multi-DC support]] T132076 1/2 (duration: 00m 50s) [14:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:11] T132076: TTMServer should support multi-dc configuration - https://phabricator.wikimedia.org/T132076 [14:13:19] dcausse: ^^ [14:13:31] logs looks good [14:14:08] addshore: thanks, will monitor and test some maint scripts [14:14:11] okay! [14:14:34] (03CR) 10Addshore: [C: 032] Add InterwikiSorting extension to prod extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341121 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [14:14:36] (03CR) 10Addshore: [C: 032] Create extension1 db cluster for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341120 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [14:14:38] (03CR) 10Addshore: [C: 032] Add Cognate to labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341122 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [14:14:56] you need some deployments to do there addshore :D [14:14:59] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:335824|Enable Translation memories multi-DC support]] T132076 2/2 (NOOP) (duration: 00m 42s) [14:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:56] (03Merged) 10jenkins-bot: Add InterwikiSorting extension to prod extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341121 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [14:16:09] (03Merged) 10jenkins-bot: Create extension1 db cluster for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341120 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [14:16:28] (03Merged) 10jenkins-bot: Add Cognate to labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341122 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [14:16:59] (03CR) 10jenkins-bot: Add InterwikiSorting extension to prod extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341121 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [14:19:37] bawolff, as my 3 are for beta only, and yours touch different files I can go ahead with yours now if your realy? [14:19:39] *ready [14:19:49] yep, I'm readdy [14:19:51] *ready [14:20:01] (03CR) 10Addshore: [C: 032] Change account creation throttle for idwiki to default (6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341263 (owner: 10Brian Wolff) [14:22:20] (03Merged) 10jenkins-bot: Change account creation throttle for idwiki to default (6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341263 (owner: 10Brian Wolff) [14:22:48] bawolff: syncing [14:22:53] ok [14:23:23] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:341263|Change account creation throttle for idwiki to default (6)]] (duration: 00m 39s) [14:23:27] bawolff: ^^ [14:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:32] (03CR) 10Addshore: [C: 032] Add a CSP policy to foundationwiki to prevent privacy breach [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) (owner: 10Brian Wolff) [14:23:38] whee [14:25:15] *waits for jenkins* [14:25:38] addshore: zuul is busy it may take a bit [14:25:58] *waiting for jenkins intensifies* [14:26:36] (03Merged) 10jenkins-bot: Add a CSP policy to foundationwiki to prevent privacy breach [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) (owner: 10Brian Wolff) [14:27:06] bawolff: its on mwdebug1002 [14:29:58] hmm, this seems to not be working (on the bright side it also isn't breaking anything) [14:30:09] bawolff: revert? [14:30:16] volans, can you please take a look at https://gerrit.wikimedia.org/r/#/c/338950/ .. i left a comment there to your review. [14:30:36] subbu: sure [14:30:41] bah [14:30:45] ty [14:32:10] bawolff! :P [14:32:12] so revert? or? [14:32:20] (03PS1) 10Tarrow: include blank htpasswd needed for role::toollabs::elasticsearch [labs/private] - 10https://gerrit.wikimedia.org/r/341326 [14:32:43] That was an inconvient time for network to futz [14:32:59] :D [14:33:29] I think there might be something wrong with my x-debug extension thingy in firefox [14:35:00] addshore: It works [14:35:39] (When i did it via the command line via wget. For some reason the x-wikimedia-debug extension doesn't seem to actually be sending the header :S) [14:35:56] ooooh [14:36:01] okay, I'll sync then? :) [14:36:20] yes please :) [14:36:48] syncing [14:37:16] !log addshore@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:341259|Add a CSP policy to foundationwiki to prevent privacy breach]] T159386 (duration: 00m 39s) [14:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:21] T159386: Make abusefilter on foundationwiki to prevent people accidentally violating our privacy policy - https://phabricator.wikimedia.org/T159386 [14:37:26] bawolff: ^^ [14:38:26] confirmed, works [14:38:43] bawolff: epic! [14:39:15] !log addshore@tin Synchronized wmf-config/db-labs.php: SWAT: [[gerrit:341120|Create extension1 db cluster for beta]] T156241 BETA ONLY (duration: 00m 39s) [14:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:21] T156241: Deploy Cognate extension to beta - https://phabricator.wikimedia.org/T156241 [14:39:49] addshore: Oh i get it, the X-Wikimedia-debug extension for firefox doesn't think wikimediafoundation.org is an actual wmf project [14:39:56] oooooh [14:39:59] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#3075690 (10MoritzMuehlenhoff) [14:40:01] 06Operations, 10Dumps-Generation, 07HHVM: Merge facebook/hhvm@9d2be6c30b into build of next hhvm release - https://phabricator.wikimedia.org/T143648#3075688 (10MoritzMuehlenhoff) 05Open>03Resolved That patch is part of HHVM 3.18.1, which is now available in the experimental section of apt.wikimedia.org [14:40:03] I see a PR being needed there ;) [14:41:46] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3075706 (10Marostegui) My idea would be as follows: - stop labsdb1006 -> copy `/srv/postgres` to dbstore1001 -> reimage -> copy the data back. If that works fine, repea... [14:42:40] !log addshore@tin Synchronized wmf-config/extension-list: [[gerrit:341121|Add InterwikiSorting extension to prod extension-list]] T150183 NOOP (duration: 00m 38s) [14:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:46] T150183: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183 [14:44:43] !log addshore@tin Synchronized wmf-config/extension-list-labs: Remove [[gerrit:341121|InterwikiSorting]] and add [[gerrit:341122|Cognate]] to extension-list-labs T150183 T156241 BETA ONLY (duration: 00m 39s) [14:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:50] T156241: Deploy Cognate extension to beta - https://phabricator.wikimedia.org/T156241 [14:44:54] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3075715 (10jcrespo) The most important step, and why we need to copy that data away in case something goes wrong is the postgres upgrade from 9.1 (precise) to 9.4 (jessie... [14:45:07] (03CR) 10Elukey: [C: 032] Fix partman recipe for Analytics Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/341318 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [14:47:39] Hmm, looks like i missed thinking about third party cookies used for Special:HideBanners [14:49:19] Hmm, I'm seeing some fatals since 14:40 [14:49:23] very small number [14:49:41] https://usercontent.irccloud-cdn.com/file/IOJY1VlT/ [14:50:00] 06Operations, 06Labs: openstack instance creation sometimes takes >480s - https://phabricator.wikimedia.org/T159459#3075772 (10chasemp) cleaned up and restarted leaked instances from the small fiasco of rolling out https://gerrit.wikimedia.org/r/#/c/340986/ which due to a bug in nova requiring 20m to return ca... [14:50:26] addshore: its probably nothing just a standard error that happens every now and then [14:50:27] !log labnet1001 'service nova-fullstack restart' [14:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:10] Fatal error: unknown exception [14:52:14] how helpful is that message ? :D [14:52:39] hashar: I know right.. [14:52:54] hashar: well you see its is very helpful, it tells you it is not smart enough to know why its failing [14:53:10] addshore: is kibanna's dashboard public? [14:53:13] hashar: 14:40 [14:53:26] hashar: 14:40 would be https://gerrit.wikimedia.org/r/#/c/341120/ or https://gerrit.wikimedia.org/r/#/c/341121/ if it were me [14:55:14] hashar: https://phabricator.wikimedia.org/T112071 ? [14:55:50] digging in logs :/ [14:58:30] subbu: I'm sending a "proposal" CR so that I can run the puppet compiler against it [14:59:00] addshore: no leads. I would not worry too much [14:59:09] hmm, okay [14:59:12] addshore: looks like a large chunk is related to some db replication lag [14:59:15] and the usual log spam [14:59:50] hashar: yeh, looks like it has dropped off :) [14:59:54] !log EU SWAT done [14:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:04] !log restarting Jenkins [15:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:27] (03PS3) 10Volans: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [15:06:37] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3075860 (10Marostegui) I am manually executing the "predump" script on dbstore1001 on a root screen called `dumps`. At least to have a local copy of the backups until we fix the bacu... [15:07:15] (03CR) 10Subramanya Sastry: [C: 031] Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [15:08:51] (03PS1) 10Brian Wolff: Add other WMF domains to foundationwiki CSP policy for Special:HideBanners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 [15:10:08] (03CR) 10Brian Wolff: "For reference, see https://wikimediafoundation.org/wiki/Thank_You/da?country=DK&action=raw&templates=expand&ctype=text/css for the html in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341331 (owner: 10Brian Wolff) [15:10:30] (03PS1) 10MarkTraceur: Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) [15:10:40] (03CR) 10jerkins-bot: [V: 04-1] Enable 3d extension on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341332 (https://phabricator.wikimedia.org/T159717) (owner: 10MarkTraceur) [15:13:56] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3075900 (10Marostegui) >>! In T153768#3075860, @Marostegui wrote: > I am manually executing the "predump" script on dbstore1001 on a root screen called `dumps`. At least to have a lo... [15:14:26] (03CR) 10jenkins-bot: Add Cognate to labs extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341122 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [15:22:44] (03CR) 10DCausse: [C: 031] "lgtm, we can merge it now or just wait for the next upgrade to catch remaining errors if any." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/340947 (owner: 10Gehel) [15:35:12] (03PS1) 10Muehlenhoff: Add two additional privileged groups [puppet] - 10https://gerrit.wikimedia.org/r/341335 [15:38:00] (03PS1) 10Muehlenhoff: Remove non-existing group from jupyterhub LDAP config [puppet] - 10https://gerrit.wikimedia.org/r/341336 (https://phabricator.wikimedia.org/T129788) [15:38:22] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076003 (10akosiaris) >>! In T157359#3075706, @Marostegui wrote: > My idea would be as follows: > > - stop labsdb1006 -> copy `/srv/postgres` to dbstore1001 -> reimage -... [15:41:19] 06Operations, 10fundraising-tech-ops, 10netops: set up firewall policies for barium, lutetium, db1025, and indium replacement servers - https://phabricator.wikimedia.org/T159336#3076013 (10Jgreen) [15:41:47] (03PS1) 10Elukey: Rework analytics-flex partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) [15:43:17] 06Operations, 10fundraising-tech-ops, 10netops: set up firewall policies for barium, lutetium, db1025, and indium replacement servers - https://phabricator.wikimedia.org/T159336#3064590 (10Jgreen) Also (fundraising private repo): commit 8e403abe1e552b078d217479c9f48ed23d892380 Author: Jeff Green 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076026 (10jcrespo) > There is also this days pg_upgrade which with --link mode which should in theory help avoid that problem, but I 've never tested it in a 9.1 => 9.4... [15:46:21] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076028 (10akosiaris) >>! In T157359#3076026, @jcrespo wrote: >> There is also this days pg_upgrade which with --link mode which should in theory help avoid that problem... [15:49:17] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076041 (10jcrespo) Hey, I am not saying it is going to work 100% sure- I am just suggesting to try it first, and then go the slow route, which is basically what you sugg... [15:51:27] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076044 (10Marostegui) >>! In T157359#3076003, @akosiaris wrote: > > Actually I had a different plan in mind. So, labsdb1007 is a read-only slave of labsdb1006. My propo... [15:53:40] 06Operations, 10ops-codfw, 10DBA: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3076045 (10Papaul) p:05Triage>03Normal [15:54:04] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3076046 (10Papaul) p:05Triage>03Normal [15:54:56] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3076047 (10Papaul) p:05Triage>03Normal [15:56:42] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3076048 (10Papaul) [15:57:51] (03CR) 10jenkins-bot: Add a CSP policy to foundationwiki to prevent privacy breach [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341259 (https://phabricator.wikimedia.org/T159386) (owner: 10Brian Wolff) [15:58:00] (03CR) 10jenkins-bot: Create extension1 db cluster for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341120 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [15:58:30] (03CR) 10jenkins-bot: Change account creation throttle for idwiki to default (6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341263 (owner: 10Brian Wolff) [15:59:01] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076066 (10akosiaris) >>! In T157359#3076041, @jcrespo wrote: > Hey, I am not saying it is going to work 100% sure- I am just suggesting to try it first, and then go the... [15:59:32] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3076068 (10Ottomata) Q: Does T159165 mean that we no longer need to get a new stat box with a GPU? Or is this ticket still valid? I'm ab... [16:00:52] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076077 (10Marostegui) >>! In T157359#3076066, @akosiaris wrote: > Yeah I can do that if it makes you two happier. That would be appreciated. Not happier per se, but a... [16:01:22] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076082 (10jcrespo) @akosiaris Do you know if labsdb1007 is actively in use? If not at all, we could start doing it now, ahead of the maintenance window... [16:05:12] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:12] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:12] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:19] ^ i will silence that [16:05:21] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:21] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:21] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:21] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:21] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:22] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [16:05:31] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:31] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:31] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:31] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:31] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:05:32] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:06:56] silenced [16:09:11] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [16:09:11] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [16:09:11] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [16:09:11] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [16:09:11] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [16:09:12] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [16:09:12] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [16:09:13] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [16:09:13] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [16:09:21] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [16:09:21] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [16:09:21] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [16:09:21] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [16:09:21] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [16:09:22] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [16:16:31] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:58] (03CR) 10Tarrow: "This change seems to break installing elasticsearch (in my case from role::toollabs::elasticsearch)." [puppet] - 10https://gerrit.wikimedia.org/r/341303 (owner: 10Gehel) [16:24:19] tarrow: patch coming up... I did test for our various production clusters, but not for labs... [16:24:59] gehel: thanks! I had a poke around myself trying to find a fix first but didn't succeed [16:25:01] (03PS1) 10ArielGlenn: fix typo in name of list of files/dirs to include for public rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/341343 [16:25:19] (03PS1) 10Gehel: elasticsearch - plugin directory is managed outside of the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341344 [16:25:42] tarrow: ^ (and sorry for the pain) [16:26:39] (03CR) 10ArielGlenn: [C: 032] fix typo in name of list of files/dirs to include for public rsyncs [puppet] - 10https://gerrit.wikimedia.org/r/341343 (owner: 10ArielGlenn) [16:29:01] (03CR) 10Gehel: [C: 032] elasticsearch - plugin directory is managed outside of the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341344 (owner: 10Gehel) [16:29:09] (03PS2) 10Gehel: elasticsearch - plugin directory is managed outside of the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341344 [16:29:24] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch - plugin directory is managed outside of the elasticsearch class [puppet] - 10https://gerrit.wikimedia.org/r/341344 (owner: 10Gehel) [16:31:37] tarrow: you should be good. Ping me if not [16:32:10] :D [16:37:11] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: add the #acl*operations-team to the s9 analytics space for nda approvals - https://phabricator.wikimedia.org/T152718#3076193 (10RobH) 05Open>03declined I'm now going to decline this, since our access policies changed for shell requests. ALL shell... [16:38:41] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:40:11] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: puppet fail [16:41:31] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 614 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4218525 keys, up 126 days 8 hours - replication_delay is 614 [16:41:41] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 622 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4218585 keys, up 126 days 8 hours - replication_delay is 622 [16:42:40] 06Operations, 06Labs, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#3076211 (10yuvipanda) There's a thread on ops-l now because docker is now docker community edition. [16:44:31] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:45:11] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 213 seconds ago with 0 failures [16:45:51] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3076221 (10Halfak) I think that having a GPU in a stats machine for modeling work will be critical for the research team and any other mode... [16:55:48] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341348 [16:58:24] 06Operations: Superfluous rsyncd on ruthenium - https://phabricator.wikimedia.org/T159676#3076243 (10Dzahn) 05Open>03Resolved Yes, this was from a migration. Thanks for pointing it out. - stopped rsyncd - deleted config files - confirmed it doesn't come back even if somebody tries to start it [17:03:34] (03PS4) 10Volans: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [17:04:04] 06Operations, 10ops-codfw, 10DBA: Predictive disk failure on db2048 - https://phabricator.wikimedia.org/T159666#3076250 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.... [17:04:32] 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T159665#3076252 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [17:07:41] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:08:46] (03PS1) 10DatGuy: Turn off patrolling for FlaggedRevs in bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341350 (https://phabricator.wikimedia.org/T158662) [17:16:14] (03PS5) 10Volans: Ruthenium VisualDiff: Test w/ local Parsoid instead of prod Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [17:28:22] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3076294 (10Papaul) [17:32:37] (03CR) 10Volans: "Puppet compiler seems ok to me: https://puppet-compiler.wmflabs.org/5663/ruthenium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/338950 (owner: 10Subramanya Sastry) [17:38:12] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076308 (10akosiaris) It is not. It is a read-only slave not really being used by anyone currently so we are free to start the process on it well ahead of the maint window. [17:38:59] 06Operations, 10Domains, 10Traffic: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3076309 (10Beetlebeard) >>! In T158638#3043708, @Reedy wrote: > https://github.com/wikimedia/operations-dns/blob/master/templates/wikimedia.ee > > If you follow "Add a record to... [17:39:12] (03PS1) 10Papaul: DNS/Decom Remove mgmt dns for mc2001-mc2016 [dns] - 10https://gerrit.wikimedia.org/r/341352 [17:41:13] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: labsdb1006/1007 (postgresql) maintenance - https://phabricator.wikimedia.org/T157359#3076313 (10jcrespo) Thanks, I will start "breaking" it tomorrow Tuesday during EU morning- do not worry, I can take care of this- you probably have more urgent things to... [17:42:19] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3076318 (10Papaul) [17:45:41] (03CR) 10Addshore: [C: 04-2] "This now doesn't require table creation, but it does require a database to be created called "cognate_wiktionary" on the beta db" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [17:46:09] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3076325 (10Papaul) a:05Papaul>03RobH @RobH I am done with this task you can go ahead and remove the port information on the switches. Not... [17:49:51] (03CR) 10Addshore: [C: 031] include blank htpasswd needed for role::toollabs::elasticsearch [labs/private] - 10https://gerrit.wikimedia.org/r/341326 (owner: 10Tarrow) [17:56:17] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Shreyas Lakhtakia (shrlak) - https://phabricator.wikimedia.org/T158978#3076331 (10Ottomata) 05Open>03Resolved a:03Ottomata No objections since Wednes... [17:57:12] (03CR) 10Jcrespo: [C: 031] puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/337837 (https://phabricator.wikimedia.org/T95158) (owner: 10Jcrespo) [17:57:57] (03PS1) 10Chad: Update interwiki map, horizon -> horizonlabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341356 [17:58:26] (03PS2) 10Chad: Update interwiki map, horizon -> horizonlabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341356 (https://phabricator.wikimedia.org/T159680) [17:58:28] 06Operations, 10Wikimedia-Stream: rcstream service - gevent dependency incompatibility - https://phabricator.wikimedia.org/T153773#2890586 (10Ottomata) FYI: RCStream is planned for decommission in July this year. [17:59:05] 06Operations, 10Wikimedia-Stream: Error on RCSteam server startup for the "flash policy server" - https://phabricator.wikimedia.org/T153770#2890496 (10Ottomata) FYI, RCStream is planned to be decomissioned in July this year. [17:59:15] (03CR) 10Chad: [C: 032] Update interwiki map, horizon -> horizonlabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341356 (https://phabricator.wikimedia.org/T159680) (owner: 10Chad) [18:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T1800). [18:00:31] (03Merged) 10jenkins-bot: Update interwiki map, horizon -> horizonlabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341356 (https://phabricator.wikimedia.org/T159680) (owner: 10Chad) [18:00:41] (03CR) 10jenkins-bot: Update interwiki map, horizon -> horizonlabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341356 (https://phabricator.wikimedia.org/T159680) (owner: 10Chad) [18:01:03] (03PS2) 10Elukey: Rework analytics-flex partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) [18:01:28] (03CR) 10Dzahn: [C: 032] include blank htpasswd needed for role::toollabs::elasticsearch [labs/private] - 10https://gerrit.wikimedia.org/r/341326 (owner: 10Tarrow) [18:01:31] (03CR) 10Dzahn: [V: 032 C: 032] include blank htpasswd needed for role::toollabs::elasticsearch [labs/private] - 10https://gerrit.wikimedia.org/r/341326 (owner: 10Tarrow) [18:02:16] thanks for the interwiki map update RainbowSprinkles ;) [18:02:27] !log gehel@tin Started deploy [wdqs/wdqs@7b77735]: (no justification provided) [18:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:58] !log demon@tin Synchronized wmf-config/interwiki.php: Sync interwiki list, T159680 (duration: 00m 41s) [18:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:04] T159680: Please sync. Interwiki Map following update on Meta - https://phabricator.wikimedia.org/T159680 [18:04:13] !log gehel@tin Finished deploy [wdqs/wdqs@7b77735]: (no justification provided) (duration: 01m 46s) [18:04:16] (03CR) 10Dzahn: [C: 031] DNS/Decom Remove mgmt dns for mc2001-mc2016 [dns] - 10https://gerrit.wikimedia.org/r/341352 (owner: 10Papaul) [18:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:28] TabbyCat: You're welcome, all done [18:04:47] SMalyshev: wdqs deployed, tests are looking good! [18:05:11] (03CR) 10Dzahn: [C: 031] "robh, can you confirm (ports on switches still to be deactivated i think)" [dns] - 10https://gerrit.wikimedia.org/r/341352 (owner: 10Papaul) [18:05:36] RainbowSprinkles: I don't think normal users can run the dumpInterwiki.php script on our machines, right? [18:06:04] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3076398 (10Dzahn) @Robh when done can you confirm https://gerrit.wikimedia.org/r/#/c/341352/ is ready to go? [18:06:18] (03CR) 10RobH: [C: 031] "I still have to wipe the port config, but the servers have been unracked so all DNS can indeed be removed. (Just that task has to assign " [dns] - 10https://gerrit.wikimedia.org/r/341352 (owner: 10Papaul) [18:06:24] TabbyCat: Like, non shell users? No, no they can't [18:06:31] Gotta be a deployer [18:06:36] (it's config) [18:06:40] mutante: did on gerrit already [18:06:48] (03PS2) 10Dzahn: DNS/Decom Remove mgmt dns for mc2001-mc2016 [dns] - 10https://gerrit.wikimedia.org/r/341352 (owner: 10Papaul) [18:06:48] you can merge that since the systems are unracked at this time [18:06:50] Okay, that was my question indeed. [18:07:00] (03PS12) 10Madhuvishy: nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) [18:07:08] (03CR) 10Dzahn: [C: 032] "ok, thanks, merging this. ticket has been assigned to you" [dns] - 10https://gerrit.wikimedia.org/r/341352 (owner: 10Papaul) [18:07:22] robh: :) just saw, thanks [18:07:31] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3076416 (10RobH) That can merge, but leave this ticket open and assigned to me until I remove the port descriptions from the switches. [18:07:42] gehel: thank you, all seems good! [18:07:52] im middle of a couple other task updates so will wipe siwtch port description shortly [18:08:04] mutante: the switch port disable is time sensitive, the switch port description wipe far less so ;D [18:08:07] but thanks for checking! [18:08:22] gehel: can I deploy db-codfw.php? [18:08:32] robh: yea, just extra careful now with the decom steps [18:09:54] marostegui: wdqs deploy completed, you're good to go! [18:10:03] thanks!! [18:10:10] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341348 [18:10:43] 06Operations, 13Patch-For-Review, 06Services (watching), 15User-mobrovac: setup/deploy scb2005 & scb2006 - https://phabricator.wikimedia.org/T159486#3076429 (10mobrovac) Kk, thnx @RobH, I'll take it from here. [18:12:23] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341348 (owner: 10Marostegui) [18:12:51] (03PS3) 10Elukey: Rework analytics-flex partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) [18:13:12] mutante: any chance that (if you have time) you could review --^ and give me some hints [18:13:15] ? [18:13:35] (03CR) 10Rush: [C: 031] "(having not actually run this) it seems good, glad for the python translation. Lends itself to a next round of some relative sanity check" [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [18:13:47] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341348 (owner: 10Marostegui) [18:14:01] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341348 (owner: 10Marostegui) [18:14:59] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2060 - T159414 (duration: 00m 44s) [18:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:06] T159414: Rampant differences in indexes and PK on s6 (frwiki, jawiki, ruwiki) for revision table - https://phabricator.wikimedia.org/T159414 [18:15:11] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:15:33] elukey: hmm, i don't actually know much about partman recipes, i think dc-ops people are much better reviewer for that [18:15:36] (03PS3) 10Rush: labstore: Cleanup old/unused labstore1001 nfs related puppet files [puppet] - 10https://gerrit.wikimedia.org/r/339577 (https://phabricator.wikimedia.org/T158196) (owner: 10Madhuvishy) [18:16:33] (03PS1) 10Dzahn: change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) [18:16:46] (03CR) 10jerkins-bot: [V: 04-1] change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [18:16:54] mutante: sure thanks :) [18:16:58] (03PS2) 10Dzahn: change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) [18:17:11] (03CR) 10jerkins-bot: [V: 04-1] change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) (owner: 10Dzahn) [18:17:54] (03CR) 10Madhuvishy: [C: 032] nfs: Snapshot backup device on secondary DC before replicating latest from remote [puppet] - 10https://gerrit.wikimedia.org/r/334692 (https://phabricator.wikimedia.org/T149870) (owner: 10Madhuvishy) [18:19:13] (03CR) 10Rush: [C: 031] labstore: Cleanup old/unused labstore1001 nfs related puppet files [puppet] - 10https://gerrit.wikimedia.org/r/339577 (https://phabricator.wikimedia.org/T158196) (owner: 10Madhuvishy) [18:19:55] (03PS3) 10Dzahn: change MX records for wikimedia.ee from elkdata.ee to Google [dns] - 10https://gerrit.wikimedia.org/r/341359 (https://phabricator.wikimedia.org/T158638) [18:22:19] 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3076473 (10Dzahn) @Beetlebeard how does that Gerrit link look to you? [18:22:25] !log analytics1040 has been silenced and it is not ready to work, need to fix its partman recipe [18:22:29] 06Operations, 10hardware-requests, 13Patch-For-Review: hardware request for netmon1001 replacement - https://phabricator.wikimedia.org/T156040#3076477 (10RobH) p:05Triage>03High [18:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:42] 06Operations, 10hardware-requests: hardware request for netmon1001 replacement - https://phabricator.wikimedia.org/T156040#2962228 (10RobH) [18:24:34] (03PS2) 10Muehlenhoff: Add two additional privileged groups [puppet] - 10https://gerrit.wikimedia.org/r/341335 [18:27:39] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3076549 (10debt) 05Resolved>03Open a:05debt>03None This might be causing issues - as noted recently in T153764... [18:27:50] (03CR) 10Muehlenhoff: [C: 032] Add two additional privileged groups [puppet] - 10https://gerrit.wikimedia.org/r/341335 (owner: 10Muehlenhoff) [18:29:14] 06Operations, 10DBA, 05Prometheus-metrics-monitoring: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb - https://phabricator.wikimedia.org/T145072#3076558 (10fgiunchedi) Update from the monitoring meeting, this can be implemented via puppetdb queries, additionally the... [18:30:17] (03Abandoned) 10DCausse: Elastic 5.2.1 plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/338756 (owner: 10DCausse) [18:32:14] 06Operations, 10Wikimedia-Apache-configuration: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3076590 (10Dzahn) a:03Dzahn [18:32:48] 06Operations, 10Wikimedia-Apache-configuration: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3053534 (10Dzahn) @gpaumier The name can be used. I will take this to get it up before Wednesday. [18:34:00] (03PS4) 10Madhuvishy: labstore: Cleanup old/unused labstore1001 nfs related puppet files [puppet] - 10https://gerrit.wikimedia.org/r/339577 (https://phabricator.wikimedia.org/T158196) [18:34:06] (03PS1) 10Dzahn: add 2030.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/341362 (https://phabricator.wikimedia.org/T158981) [18:34:10] (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: Cleanup old/unused labstore1001 nfs related puppet files [puppet] - 10https://gerrit.wikimedia.org/r/339577 (https://phabricator.wikimedia.org/T158196) (owner: 10Madhuvishy) [18:42:05] 06Operations: Production error message points users to donate link, that is likely to also produce the same error message - https://phabricator.wikimedia.org/T154627#3076637 (10Ottomata) p:05Triage>03Low I'm not sure who to assign this to, or if it is totally an operations task. Maybe someone on the TBD Med... [18:42:38] 06Operations, 10Wikimedia-Stream: Error on RCSteam server startup for the "flash policy server" - https://phabricator.wikimedia.org/T153770#3076640 (10Ottomata) p:05Triage>03Low [18:42:40] (03PS1) 10Dzahn: rewrite 2030.wikimedia.org to meta page [puppet] - 10https://gerrit.wikimedia.org/r/341363 (https://phabricator.wikimedia.org/T158981) [18:42:48] 06Operations, 10Wikimedia-Stream: rcstream service - gevent dependency incompatibility - https://phabricator.wikimedia.org/T153773#3076642 (10Ottomata) p:05Triage>03Low [18:43:21] 06Operations, 10netops: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#3076644 (10Ottomata) p:05Triage>03Normal [18:43:52] 06Operations: make apt.wikimedia.org HA - https://phabricator.wikimedia.org/T158022#3076646 (10Ottomata) p:05Triage>03Normal [18:43:59] (03PS6) 10Dzahn: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [18:44:11] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:45:41] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: tin.eqiad.wmnet / partition is full - https://phabricator.wikimedia.org/T158358#3076653 (10Ottomata) 05Open>03Resolved a:03Ottomata / is no longer full, and there are other tickets to resolve the bigger problems. Resolving. [18:46:15] 06Operations, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3076659 (10Ottomata) p:05Triage>03Normal a:03Ottomata [18:46:27] 06Operations, 10Analytics, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3065770 (10Ottomata) [18:47:57] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3076667 (10Ottomata) p:05Triage>03Normal a:03Marostegui Assigning, feel free to reassign. [18:49:29] 06Operations, 07Puppet: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163#3076672 (10Ottomata) p:05Triage>03Normal a:03Volans +1 from me too. @volans, I'm just triaging, feel free to assign un-assign this at will. [18:50:08] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3076675 (10Ottomata) a:03Marostegui [18:50:15] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3053239 (10Ottomata) p:05Triage>03Normal [18:51:03] 06Operations, 06Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3076678 (10Ottomata) a:03Krinkle @krinkle, feel free to re-assign, triage as needed. [18:52:12] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3076685 (10Ottomata) p:05Triage>03Low a:03demon Triaging, feel free to re-assign as needed. [18:54:28] 06Operations: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562#3076693 (10Ottomata) p:05Triage>03Low a:03MoritzMuehlenhoff +1 in general, but yeah, I think this should basically already be happening. Perhaps manually installing our own sources.list via an ERb template would be b... [18:55:24] 06Operations, 10Analytics, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3076697 (10Ottomata) p:05Triage>03Low a:03Ottomata I'll take this on, low priority though. Remind me about it if you get fidgety! :) [18:55:57] 06Operations, 06Performance-Team: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#3076700 (10Ottomata) p:05Triage>03Normal a:03Krinkle [18:57:01] 06Operations, 10Analytics, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3066309 (10MoritzMuehlenhoff) jessie has jq 1.4, so this would also be fixed once stat1002 is migrated to jessie. [18:57:38] 06Operations, 10MediaWiki-extensions-CentralNotice, 15User-Dereckson: Create /community-beacon alternative entry point - https://phabricator.wikimedia.org/T155929#2959358 (10Ottomata) @Dereckson, I'm doing some triaging. Are you sure this is an operations task? [18:57:47] 06Operations, 10Analytics, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3076717 (10Pchelolo) After some testing of driver-librdkafka compatibility, here's the deal: 1. Currently we are using `nod... [18:58:47] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3076720 (10Ottomata) a:03matmarex Triaging, feel free to re-assign. [18:59:31] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:59:54] 06Operations, 10media-storage: unknown error occurred in storage backend "local-swift-codfw" - https://phabricator.wikimedia.org/T155323#3076723 (10Ottomata) 05Open>03declined Declining this for now, as it seems to have fixed itself! ;) If it happens again, please re-open. [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T1900). Please do the needful. [19:00:04] TabbyCat: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:22] * TabbyCat is here [19:00:25] meow [19:02:26] 06Operations, 10MediaWiki-API, 10Traffic: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3076735 (10Ottomata) p:05Triage>03Normal Ping @ema. I'm not sure how to triage this. Does something need to change on the varnish end? Or just the change @An... [19:03:57] I can SWAT [19:04:10] ^_^ [19:05:07] (03CR) 10Addshore: [C: 031] Enable Cognate for beta wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [19:06:25] (03PS12) 10Thcipriani: Rename 'technician' to 'interface-editor' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308281 (https://phabricator.wikimedia.org/T144638) (owner: 10MarcoAurelio) [19:06:29] thcipriani: I just added that one to the end^^ beta only [19:06:40] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: compile number of http uses for http://www.wikidata.org/entity - https://phabricator.wikimedia.org/T154017#3076741 (10Ottomata) p:05Triage>03Low [19:07:21] addshore: ack, np [19:07:38] 06Operations, 06Discovery, 10Traffic, 06WMDE-Tech-Communication, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3076747 (10Ottomata) This has been placed on the Traffic board, removing operations tag. [19:07:43] 06Operations, 10Analytics, 06Performance-Team: Update jq to v1.4.0 or higher - https://phabricator.wikimedia.org/T159392#3076750 (10Krinkle) @MoritzMuehlenhoff Thanks. Is there a ticket for that? I've transferred my data to terbium for post-processing for the time being because the python/ua-parser package... [19:08:04] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3068704 (10Ottomata) There were no objections, but I do not have rights to do this. [19:09:22] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1060 - https://phabricator.wikimedia.org/T158193#3076755 (10Cmjohnson) Slot 7 is just offline for some reason. Changed status to online cmjohnson@db1060:~$ sudo megacli -PDOnline -PhysDrv [32:7] -a0 EnclId-32 SlotId-... [19:10:11] RECOVERY - MegaRAID on db1060 is OK: OK: optimal, 1 logical, 2 physical [19:10:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308281 (https://phabricator.wikimedia.org/T144638) (owner: 10MarcoAurelio) [19:10:48] 06Operations, 10Wikimedia-General-or-Unknown: Class 'Memcached' not found when running mwscript eval.php on debug servers - https://phabricator.wikimedia.org/T150912#3076759 (10Ottomata) p:05Triage>03Low Hopefully T146285 will make this not necessary. Setting to Low priority. [19:11:09] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3076762 (10Ottomata) a:03RobH [19:11:47] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3076765 (10Ottomata) a:03Dzahn [19:12:07] PROBLEM - mysqld processes on db1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:12:13] PROBLEM - MariaDB Slave IO: s2 on db1060 is CRITICAL: CRITICAL slave_io_state could not connect [19:12:18] PROBLEM - MariaDB Slave SQL: s2 on db1060 is CRITICAL: CRITICAL slave_sql_state could not connect [19:12:25] Sagan: You have ops right now ;] [19:12:57] you cannot just put online a disk that was offline [19:13:09] you are just breaking the server [19:13:41] (03Merged) 10jenkins-bot: Rename 'technician' to 'interface-editor' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308281 (https://phabricator.wikimedia.org/T144638) (owner: 10MarcoAurelio) [19:13:53] (03CR) 10jenkins-bot: Rename 'technician' to 'interface-editor' on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308281 (https://phabricator.wikimedia.org/T144638) (owner: 10MarcoAurelio) [19:14:13] * TabbyCat enables x-wikimedia-debug for testing [19:14:33] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Current state and next steps for RESTBase storage - https://phabricator.wikimedia.org/T152724#3076771 (10Ottomata) p:05Triage>03Normal [19:14:41] 06Operations, 10Ops-Access-Requests: Requesting access to wikimedia-operations-channel-op for Luke081515 - https://phabricator.wikimedia.org/T159473#3076772 (10RobH) 05Open>03Resolved Done and implemented. @Luke081515 can now Op himself and kick/ban trolls. [19:14:57] TabbyCat: the group rename for trwiki is on mwdebug1002, check please [19:15:52] thcipriani: lgtm on mwdebug1002 [19:16:03] TabbyCat: ok, going live, then I'll run the mwscript [19:16:06] but we need to move the users afterwards :) [19:16:37] thcipriani: not sure if the syntax I posted on Phab for the script is correct, do not trust it if in doubt [19:16:53] ok :) [19:17:26] jynus: need help? [19:17:29] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076776 (10Ottomata) p:05Triage>03Normal I'm not sure who will make this decision, but @robh often handles cert issues, so let's ask him. [19:17:45] I have to depool db1060 [19:18:01] media got corrupted for raid misconfig [19:18:08] oh no [19:18:11] I will depool it [19:18:13] no outage that I can see [19:18:47] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#2855117 (10BBlack) Probably this should be merged into T133548, unless it's altered to be about implementing some other solution outside of... [19:19:13] PROBLEM - MariaDB Slave Lag: s2 on db1060 is CRITICAL: CRITICAL slave_sql_lag could not connect [19:19:15] plan was to run mwscript migrateUserGroup.php --wiki=trwiki 'technician' 'interface-editor' [19:19:23] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Option: Consider switching back to leveled compaction (LCS) - https://phabricator.wikimedia.org/T153703#3076799 (10Ottomata) p:05Triage>03Normal [19:19:30] should I be pausing SWAT for db problems? [19:19:45] 06Operations, 07Puppet, 10Continuous-Integration-Config: Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915#3076815 (10Ottomata) p:05Triage>03Normal [19:19:46] (03PS1) 10Marostegui: db-eqiad.php: depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341369 [19:19:50] thcipriani: I would need to push ^ [19:20:24] thcipriani, yes [19:20:29] this is important [19:20:31] marostegui: go for it, if you just sync file nothing else will go out [19:20:35] cheers [19:20:42] noted [19:20:43] jynus: ok [19:22:24] (03CR) 10Marostegui: [V: 032 C: 032] db-eqiad.php: depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341369 (owner: 10Marostegui) [19:22:56] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076819 (10RobH) Implementation of SSL on our cluster is handled by the traffic team. So if there is a problem, they would be ideal to ask.... [19:23:46] (03Merged) 10jenkins-bot: db-eqiad.php: depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341369 (owner: 10Marostegui) [19:23:55] (03CR) 10jenkins-bot: db-eqiad.php: depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341369 (owner: 10Marostegui) [19:24:31] 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3076820 (10jcrespo) [19:24:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 (duration: 00m 40s) [19:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:44] at least we have offline hosts in s2 so we can copy the data easily tomorrow after reimage [19:26:52] well, it could habe been worse [19:26:53] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3076821 (10Marostegui) a:05Marostegui>03Cmjohnson I will assign this to @Cmjohnson so he can change the disk once it is onsite Thanks! [19:27:01] innodb detects corruption istanatly [19:27:10] and shuts itself down [19:27:28] rather than serving wrong data- so that is reliablity [19:27:41] haha [19:27:48] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:27:49] jynus: can SWAT continue then? [19:27:52] I am not joking [19:28:07] TabbyCat, one sec [19:28:11] k [19:28:31] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076823 (10Dzahn) @Robh yea, but wikipedia.cz is our IP and DNS servers [19:28:43] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076837 (10Ottomata) 05Open>03declined @Urbanecm, I'm going to decline this ticket then. Wikimedia CZ owns this domain, so they'd have... [19:28:46] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#3076839 (10Ottomata) [19:29:18] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3076841 (10Ottomata) p:05Triage>03Normal I don't fully understand this issue, but I'm not sure how it could be fixed on our side. If there is a way, its likely to be very... [19:29:26] jynus: Going to downtime db1060 [19:30:27] (03PS1) 10Chad: WIP: Create hourly backup schedule, modeled on weekly [puppet] - 10https://gerrit.wikimedia.org/r/341371 [19:30:33] 06Operations, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Make it possible to run the mediawiki testsuite against a staging repo of apt.wikimedia.org - https://phabricator.wikimedia.org/T157038#3076843 (10Ottomata) p:05Triage>03Normal [19:31:41] 06Operations, 10Internet-Archive, 06Offline-Working-Group: Create backups of Wikimedia content in diverse geographic places - https://phabricator.wikimedia.org/T156544#3076844 (10Ottomata) p:05Triage>03Low [19:31:59] mw error count is still high? [19:32:43] (03PS1) 10EBernhardson: deployment-prep: Use elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/341372 [19:32:53] it is decreasing on logtash [19:33:09] but is it redis, not mysql? [19:33:21] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#3076848 (10Ottomata) [19:33:24] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076846 (10Ottomata) 05declined>03Open Re-opening, I think both @robh and I read this as 'wikimedia.cz', not 'wikipedia.cz'. wikipedia.... [19:33:27] overall seems to have dropped off at 19:25 afaict [19:33:50] yeah, I can see it now [19:34:04] 06Operations, 10MediaWiki-General-or-Unknown: Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475#3076861 (10Ottomata) p:05Triage>03Normal [19:34:32] yeah, the top wiki complaining isn't on s2 [19:34:42] thcipriani, looks good to me [19:34:48] as in, mw [19:35:19] jynus: yep, same. OK, I will continue with SWAT if it won't interfere? [19:35:22] 06Operations, 10ops-eqiad, 10DBA: Reimage db1060 due to pysical disk corruption (was: Degraded RAID on db1060) - https://phabricator.wikimedia.org/T158193#3076884 (10Marostegui) Even though the server's data is now corrupted and needs to be reimaged, the RAID is on optimal status: ``` root@db1060:~# megacli... [19:36:17] 06Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3076886 (10Ottomata) p:05Triage>03Normal a:03Dzahn Just triaging, feel free to re-assign [19:37:29] 06Operations, 06Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3076911 (10Krinkle) [19:37:39] 06Operations, 06Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle) [19:38:27] 06Operations, 06Performance-Team: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049036 (10Krinkle) I've added brief descriptions for each of these services and how they work together with other services. This also implicitly lays out the requirements. [19:38:58] 06Operations: Puppet fails only once when restarting ferm is not successful - https://phabricator.wikimedia.org/T157972#3076927 (10Ottomata) p:05Triage>03Normal Puppet can't just ensure => 'running' on the ferm service? Or is ferm a special case, and not a puppet service resource type? [19:39:25] 06Operations, 10ops-eqiad, 06Services (watching): Degraded RAID on restbase-dev1001 - https://phabricator.wikimedia.org/T157425#3076934 (10Cmjohnson) I removed the disk and will bring it with me while I am gone. @Robh will let know if and where I need to send it for RMA. [19:39:38] 06Operations, 10Traffic, 07HTTPS, 05Security: $wgServer with initial https:// does not force HTTPS - https://phabricator.wikimedia.org/T156320#3076936 (10Ottomata) [19:39:52] !log gehel@tin Started deploy [wdqs/wdqs@1f2973c]: (no justification provided) [19:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:01] 06Operations: use htpasswd instead of htdigest for arbcom archive passwords - https://phabricator.wikimedia.org/T157761#3076941 (10Dzahn) a:05Dzahn>03None Yea, this will be done but it's supposed to happen not until September and i don't want to hold on to it until then. Most likely it will be me but it's fr... [19:41:12] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076946 (10Urbanecm) >>! In T152622#3076846, @Ottomata wrote: > Re-opening, I think both @robh and I read this as 'wikimedia.cz', not 'wikip... [19:41:15] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3076947 (10Dzahn) a:05Dzahn>03None [19:41:18] !log gehel@tin Finished deploy [wdqs/wdqs@1f2973c]: (no justification provided) (duration: 01m 25s) [19:41:19] hrm, well, I do see a few errors trickling through, but it's at no where near the rate it was. [19:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:31] TabbyCat: around to check the rename once I sync? [19:41:44] thcipriani: yep, won't move until all is done [19:41:47] :) [19:42:02] alright, I'm going live with it, then I will run the mwscript [19:42:08] k [19:42:27] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#2965447 (10Ottomata) p:05Triage>03Low analytics102[67] will be decomed soon. Added T159742 for analytics100[12]. [19:42:36] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3076982 (10Krinkle) [19:42:38] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, and 3 others: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3076981 (10Krinkle) 05Open>03Resolved [19:42:48] 06Operations, 10Domains, 10Traffic, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076984 (10Urbanecm) For clarifying: * wikimedia.cz was linked just for case when somebody want's to look what I mean by WMCZ as it links t... [19:43:12] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:308281|Rename "technician" to "interface-editor" on trwiki]] T144638 (duration: 00m 46s) [19:43:12] 06Operations, 10Traffic: Hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3076986 (10Ottomata) p:05Triage>03Normal [19:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:19] T144638: Migrate 'technician' trwiki usergroup name to 'interface-editor' - https://phabricator.wikimedia.org/T144638 [19:43:19] ^ TabbyCat live now [19:43:21] 06Operations, 10Traffic: Hardware installation for Asia Cache DC - https://phabricator.wikimedia.org/T156032#3076989 (10Ottomata) p:05Triage>03Normal [19:43:24] checking [19:43:26] again [19:43:39] 06Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3076990 (10Ottomata) p:05Triage>03Normal [19:43:46] !log mwscript migrateUserGroup.php --wiki=trwiki 'technician' 'interface-editor' on terbium for T159636 [19:43:49] 06Operations, 10Traffic: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#3076992 (10Ottomata) p:05Triage>03Normal [19:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:51] T159636: migrateUserGroup.php to finish T144638 - https://phabricator.wikimedia.org/T159636 [19:44:00] 06Operations, 10Traffic: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3076995 (10Ottomata) p:05Triage>03Normal [19:44:08] 06Operations, 10Traffic: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3076997 (10Ottomata) p:05Triage>03Normal [19:44:32] 06Operations, 10DBA, 10Monitoring: Create a check/calendar alert for MariaDB TLS certs - https://phabricator.wikimedia.org/T152427#3077001 (10Ottomata) p:05Triage>03Normal [19:44:39] thcipriani: I guess the script is still running? [19:44:53] yes [19:44:54] 'cause no user holds the newly-renamed permission [19:45:07] ah, it's being populated [19:45:31] 06Operations, 10Traffic, 07HTTPS, 05Security: $wgServer with initial https:// does not force HTTPS - https://phabricator.wikimedia.org/T156320#3077006 (10Ottomata) p:05Triage>03Normal [19:45:48] (03CR) 10Papaul: "Are you trying to create a RAID1?" [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [19:45:50] 06Operations: Unclean stop of jobrunner service via puppet - https://phabricator.wikimedia.org/T158288#3077013 (10Ottomata) p:05Triage>03Normal [19:46:29] TabbyCat: done! [19:46:39] thcipriani: https://tr.wikipedia.org/w/index.php?title=%C3%96zel%3AKullan%C4%B1c%C4%B1Listesi&username=&group=technician clean! [19:46:46] (03CR) 10Ottomata: "Yes, all of these partitions listed here should be RAID1" [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [19:46:51] and https://tr.wikipedia.org/w/index.php?title=%C3%96zel%3AKullan%C4%B1c%C4%B1Listesi&username=&group=interface-editor&limit=50 [19:46:54] plently of them [19:47:01] nice [19:47:21] 06Operations, 10MediaWiki-General-or-Unknown, 06Multimedia: Segmentation fault creating thumbnail - https://phabricator.wikimedia.org/T159242#3077020 (10Ottomata) p:05Triage>03Normal [19:47:32] 06Operations, 10Gerrit, 06Release-Engineering-Team: Decide weather to disable drafts in gerrit - https://phabricator.wikimedia.org/T158656#3077021 (10demon) 05Open>03Resolved Sorry, this was never an #operations issue. Question was decide whether to disable or not? I decided not to. Really, there's noth... [19:47:34] thcipriani: perfect, we can now continue [19:47:35] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3077023 (10jcrespo) There are now 2 backups happening on dbstore1001, one on 201703061505 and another on 201703061552, one from screen, another from bacula-fd :-/ [19:47:44] TabbyCat: okie doke [19:48:03] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: can loose data if it crashes - https://phabricator.wikimedia.org/T159743#3077029 (10Paladox) [19:48:09] (03PS2) 10Thcipriani: Add 'flow-create-board' to CommonSettings.php for global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341193 (owner: 10MarcoAurelio) [19:48:37] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: can loose data if it crashes - https://phabricator.wikimedia.org/T159743#3077045 (10Paladox) [19:48:42] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341193 (owner: 10MarcoAurelio) [19:49:36] 06Operations: Puppet constantly trying to stop the already stopped puppetmaster process on Trusty - https://phabricator.wikimedia.org/T159536#3077047 (10Ottomata) p:05Triage>03Normal [19:49:38] (03Merged) 10jenkins-bot: Add 'flow-create-board' to CommonSettings.php for global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341193 (owner: 10MarcoAurelio) [19:49:47] (03CR) 10jenkins-bot: Add 'flow-create-board' to CommonSettings.php for global groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341193 (owner: 10MarcoAurelio) [19:50:21] 06Operations: backup space is used unwisely - https://phabricator.wikimedia.org/T159524#3077050 (10Ottomata) p:05Triage>03Normal [19:50:45] 06Operations, 10MediaWiki-extensions-CentralNotice, 15User-Dereckson: Create /community-beacon alternative entry point - https://phabricator.wikimedia.org/T155929#3077051 (10Ottomata) p:05Triage>03Normal [19:51:01] 06Operations, 06Discovery, 10Traffic, 06WMDE-Tech-Communication, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3077052 (10Ottomata) p:05Triage>03Normal [19:51:02] TabbyCat: change is live on mwdebug1002 if there's anything to check there [19:51:12] thcipriani: yep, let me see [19:51:46] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: can loose data if it crashes - https://phabricator.wikimedia.org/T159743#3077059 (10Ottomata) p:05Triage>03Normal [19:51:48] thcipriani: https://meta.wikimedia.org/wiki/Special:GlobalGroupPermissions/steward now has it on mwdebug [19:51:52] = okay [19:51:57] ok :) [19:52:01] going live [19:52:42] !log restarting wdqs-updater on wdqs* servers to activate GC logs - T159248 [19:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:47] T159248: collect usual GC metrics for Blazegraph JVMs - https://phabricator.wikimedia.org/T159248 [19:53:45] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:341193|Add "flow-create-board" to CommonSettings.php for global groups]] (duration: 00m 40s) [19:53:49] ^ TabbyCat live [19:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:55] checking again [19:55:04] 06Operations, 10Domains, 10Traffic, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3077090 (10Dzahn) a:03Dzahn [19:55:05] (03PS2) 10Thcipriani: Create 'flood' flag for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341134 (owner: 10MarcoAurelio) [19:56:23] (03PS3) 10Chad: Scap clean: abort if a branch is still in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340250 [19:59:06] thcipriani: CS live works finelly :) [19:59:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341134 (owner: 10MarcoAurelio) [19:59:40] TabbyCat: great! [20:00:04] matt_flaschen: Respected human, time to deploy Flow enable (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T2000). Please do the needful. [20:00:27] matt_flaschen: Respected human, please wait until SWAT is done :) [20:00:57] (03CR) 10Papaul: "There is no line in the recipe that is creating the RAID1 we need to have something like (d-i partman-auto-raid/recipe string \" [puppet] - 10https://gerrit.wikimedia.org/r/341337 (https://phabricator.wikimedia.org/T159530) (owner: 10Elukey) [20:00:59] (03Merged) 10jenkins-bot: Create 'flood' flag for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341134 (owner: 10MarcoAurelio) [20:01:06] (03CR) 10jenkins-bot: Create 'flood' flag for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341134 (owner: 10MarcoAurelio) [20:02:43] TabbyCat: since ^ is for wikitech I will just sync linve [20:02:52] live even [20:02:56] yep, because wikitech is not on silver [20:02:59] or was it [20:03:03] I never remember [20:03:30] it *is* on silver and is not available on mwdebug hosts :) [20:03:52] * TabbyCat tattooes "Wikitech is on silver" on his arm [20:04:54] 06Operations, 10MediaWiki-extensions-CentralNotice, 15User-Dereckson: Create /community-beacon alternative entry point - https://phabricator.wikimedia.org/T155929#3077147 (10Dereckson) >>! In T155929#3076714, @Ottomata wrote: > @Dereckson, I'm doing some triaging. Are you sure this is an operations task? T... [20:05:26] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:341134|Create "flood" flag for labswiki]] (duration: 00m 40s) [20:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:33] ^ TabbyCat live everywher [20:05:35] e [20:05:48] thcipriani: testing [20:06:11] works thcipriani [20:06:13] thanks [20:06:23] TabbyCat: cool, thanks for checking [20:07:04] I might amend the flood patch so it's not that confusing (a * appears on UserRights because the right to remove the flag is granted to the flood group itself, but I think I'll switch that to sysop and contentadmin too) [20:07:12] but that'll be tomorrow [20:07:15] (03PS2) 10Thcipriani: Enable Cognate for beta wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [20:07:24] addshore: your turn :) [20:07:33] 06Operations, 10Internet-Archive, 06Offline-Working-Group: Create backups of Wikimedia content in diverse geographic places - https://phabricator.wikimedia.org/T156544#3077155 (10Ariconte) >>! In T156544#2998426, @faidon wrote: > We are going to discuss it internally and update this task when we have more on... [20:07:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [20:08:37] TabbyCat: thcipriani thanks! [20:08:43] (03Merged) 10jenkins-bot: Enable Cognate for beta wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [20:08:52] (03CR) 10jenkins-bot: Enable Cognate for beta wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341123 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [20:11:04] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:341123|Enable Cognate for beta wiktionaries]] T156241 beta-only change (duration: 00m 43s) [20:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:13] T156241: Deploy Cognate extension to beta - https://phabricator.wikimedia.org/T156241 [20:11:28] addshore: yw :) should go live with the next beta code update [20:11:41] SWAT is complete [20:12:03] matt_flaschen: sorry if I stepped on any of your deployment window :( [20:12:24] Thanks, thcipriani. It should work out. [20:12:28] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:12:33] thanks thcipriani ! [20:15:19] yeah sorry matt_flaschen -- it's all fault of jynus and marostegui :P [20:18:45] (03PS2) 10Ottomata: Only pipe /v2/stream requests to EventStreams service, everything else can be cached by varnish [puppet] - 10https://gerrit.wikimedia.org/r/340246 (https://phabricator.wikimedia.org/T158066) [20:19:34] (03PS1) 10Addshore: Disable Cognate on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341380 (https://phabricator.wikimedia.org/T156241) [20:20:08] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3077244 (10madhuvishy) @Cmjohnson Moving to row B is fine, but we'd have to move both servers to row B then. We need both servers t... [20:22:32] (03CR) 10Ottomata: [C: 032] Only pipe /v2/stream requests to EventStreams service, everything else can be cached by varnish [puppet] - 10https://gerrit.wikimedia.org/r/340246 (https://phabricator.wikimedia.org/T158066) (owner: 10Ottomata) [20:22:37] (03CR) 10Ottomata: [C: 032] "Looks ok to me: https://puppet-compiler.wmflabs.org/5667/cp1045.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/340246 (https://phabricator.wikimedia.org/T158066) (owner: 10Ottomata) [20:22:42] matt_flaschen: If it doesn't step on your toes I would like to +2 and sync https://gerrit.wikimedia.org/r/#/c/341380/1 (beta only) undoing 1 thing from swat.. [20:23:20] addshore, that's fine, go ahead. [20:23:24] thanks! [20:23:36] (03CR) 10Addshore: [C: 032] Disable Cognate on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341380 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [20:24:38] (03Merged) 10jenkins-bot: Disable Cognate on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341380 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [20:24:52] (03CR) 10jenkins-bot: Disable Cognate on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341380 (https://phabricator.wikimedia.org/T156241) (owner: 10Addshore) [20:26:47] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:341380|Disable Cognate on beta wiktionary sites]] T156241 Beta Only (duration: 00m 46s) [20:26:52] matt_flaschen: thanks! [20:26:52] (03CR) 10Subramanya Sastry: "wfm" [puppet] - 10https://gerrit.wikimedia.org/r/341290 (owner: 10Muehlenhoff) [20:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:54] T156241: Deploy Cognate extension to beta - https://phabricator.wikimedia.org/T156241 [20:27:18] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=885.80 Read Requests/Sec=2923.00 Write Requests/Sec=22.60 KBytes Read/Sec=18280.00 KBytes_Written/Sec=386.40 [20:28:09] 06Operations, 10DBA, 13Patch-For-Review: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#3077270 (10Marostegui) >>! In T153768#3077023, @jcrespo wrote: > There are now 2 backups happening on dbstore1001, one on 201703061505 and another on 201703061552, one from screen, a... [20:31:16] (03Draft2) 10MarcoAurelio: Modify add/remove groups for I984157d5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341382 [20:32:05] (03PS3) 10MarcoAurelio: Modify add/remove groups for I984157d5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341382 [20:32:48] I guess that deploying ^ now is not possible? [20:33:03] if not, I'll wait for the next swat [20:33:55] SWAT is over. Changes that only affect Beta can be deployed at any time (as long as they don't conflict with something ongoing). [20:34:09] (03CR) 10Dzahn: [C: 032] toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [20:34:12] wikitech is not beta so I'll wait then [20:34:19] (03PS7) 10Dzahn: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [20:34:25] !log Ran (time mwscript extensions/Flow/maintenance/convertNamespaceFromWikitext.php --wiki=cawiki 'Viquiprojecte_Discussió') 2>&1|tee --append ~/2017-03-02_cawiki_convertNamespacesFromWikitext_Viquiprojecte_Discussió.log [20:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:39] !log For T159047 [20:34:40] !log reimport waterlines data on maps1001.eqiad.wmnet - T159631 [20:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:45] T159047: Enabling Flow in cawiki 'Viquiprojecte Discussió' namespace - https://phabricator.wikimedia.org/T159047 [20:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:50] T159631: Broken tiles at z10+ - https://phabricator.wikimedia.org/T159631 [20:40:17] (03PS4) 10Dzahn: bast3001: remove from network/constants.pp [puppet] - 10https://gerrit.wikimedia.org/r/340813 (https://phabricator.wikimedia.org/T156506) [20:40:28] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [20:41:28] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:42:18] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=245.30 Read Requests/Sec=265.40 Write Requests/Sec=3.90 KBytes Read/Sec=5611.20 KBytes_Written/Sec=196.80 [20:43:02] (03CR) 10Mobrovac: [C: 031] Add beta hewiktionary to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/341014 (https://phabricator.wikimedia.org/T158628) (owner: 10Addshore) [20:43:05] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: collect usual GC metrics for Blazegraph JVMs - https://phabricator.wikimedia.org/T159248#3077306 (10Smalyshev) 05Open>03Resolved [20:43:13] 06Operations, 10MediaWiki-General-or-Unknown, 13Patch-For-Review: foreachwikiindblist regular cronspam - https://phabricator.wikimedia.org/T159438#3067511 (10matmarex) And? [20:45:01] MatmaRex: Guess we need a root to report cronspam failures :) [20:46:17] (03CR) 10Dzahn: [C: 031] "confirmed 8003, 8010 and 8142 are http://localhost: lines in parsoid-testing.nginx. 8011 does not appear, is open by nodejs but just on tc" [puppet] - 10https://gerrit.wikimedia.org/r/341290 (owner: 10Muehlenhoff) [20:46:30] !log removing old cdh packages from thirdparty component in apt [20:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:45] (03CR) 10Mobrovac: "LGTM, but the DNS record doesn't seem to be in place yet." [puppet] - 10https://gerrit.wikimedia.org/r/340997 (owner: 10Giuseppe Lavagetto) [20:47:53] 06Operations, 06Discovery, 10Traffic, 06WMDE-Tech-Communication, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3077327 (10Smalyshev) p:05Normal>03Low Note that this is not a traffic/ops question, it's wikidata modeling question, and... [20:48:14] (03CR) 10Dzahn: [C: 031] "actually, i'll go ahead and merge this since it will not have an effect until base::firewall will be added (and also it's testing and look" [puppet] - 10https://gerrit.wikimedia.org/r/341290 (owner: 10Muehlenhoff) [20:48:29] (03PS2) 10Dzahn: Add ferm rules for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341290 (owner: 10Muehlenhoff) [20:51:47] (03PS3) 10Mattflaschen: Enable Flow on 'Viquiprojecte Discussió' on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340455 (https://phabricator.wikimedia.org/T159047) [20:53:12] (03CR) 10Mattflaschen: [C: 032] Enable Flow on 'Viquiprojecte Discussió' on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340455 (https://phabricator.wikimedia.org/T159047) (owner: 10Mattflaschen) [20:54:42] (03PS3) 10Gehel: Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 (owner: 10Smalyshev) [20:54:45] (03CR) 10Smalyshev: [C: 031] wdqs: cleanup old GC logs [puppet] - 10https://gerrit.wikimedia.org/r/340940 (https://phabricator.wikimedia.org/T159248) (owner: 10Gehel) [20:54:47] (03Merged) 10jenkins-bot: Enable Flow on 'Viquiprojecte Discussió' on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340455 (https://phabricator.wikimedia.org/T159047) (owner: 10Mattflaschen) [20:54:58] 06Operations, 10Analytics, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3065770 (10mobrovac) There's no need to have downtime at all for the upgrade - we have multiple hosts for these services an... [20:55:11] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Move cloudera packages to a separate archive section - https://phabricator.wikimedia.org/T155726#3077357 (10Ottomata) Ha! OOPS I knew I would make a mistake when cleaning up old packages! I accidentally removed almost all CDH pack... [20:55:14] (03CR) 10jenkins-bot: Enable Flow on 'Viquiprojecte Discussió' on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340455 (https://phabricator.wikimedia.org/T159047) (owner: 10Mattflaschen) [20:55:51] 06Operations, 10Analytics, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3077359 (10Pchelolo) >>! In T159379#3077355, @mobrovac wrote: > There's no need to have downtime at all for the upgrade - w... [20:56:14] 06Operations, 10Analytics, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3077373 (10mobrovac) [20:56:37] (03PS2) 10EBernhardson: deployment-prep: Use elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/341372 [20:56:39] (03PS1) 10EBernhardson: deployment-prep: Use apt experimental for elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/341398 [20:59:20] (03CR) 10Dzahn: [C: 032] Add ferm rules for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/341290 (owner: 10Muehlenhoff) [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T2100). [21:00:06] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: add the #acl*operations-team to the s9 analytics space for nda approvals - https://phabricator.wikimedia.org/T152718#3077384 (10leila) @RobH sounds good. @ggellerman: this is now declined. I know that you wanted to work with Dario to make it happen. [21:00:16] Nothing for ORES today? [21:00:31] (03CR) 10Gehel: [C: 032] Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 (owner: 10Smalyshev) [21:00:34] volans, no parsoid deploy today [21:00:40] (03PS5) 10Dzahn: bast3001: remove from network/constants.pp [puppet] - 10https://gerrit.wikimedia.org/r/340813 (https://phabricator.wikimedia.org/T156506) [21:00:49] (03PS4) 10Gehel: Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 (owner: 10Smalyshev) [21:00:50] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow for Viquiprojecte Discussió on cawiki (duration: 00m 40s) [21:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:04] gehel: !start rebase-race :) [21:01:20] mutante: please go ahead! [21:01:24] lets gehel go first [21:01:54] subbu: ? [21:02:09] sorry. misfire. :) [21:02:13] mutante: and now we are in a politness deadlock... [21:02:22] gehel: whoever gets the V+2 should hit the submit button [21:02:29] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: No anomaly detected [21:02:38] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:43] (03CR) 10Dzahn: [C: 032] bast3001: remove from network/constants.pp [puppet] - 10https://gerrit.wikimedia.org/r/340813 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [21:02:47] !log populateContentModel.php --wiki=cawiki --ns=103 run for revision, archive, page . T159047 complete [21:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:53] T159047: Enabling Flow in cawiki 'Viquiprojecte Discussió' namespace - https://phabricator.wikimedia.org/T159047 [21:03:07] gehel: ok, but now :) [21:03:29] (03PS1) 10Addshore: Enable Cognate on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341401 [21:03:37] (03PS1) 10Hashar: Change Zuul Gearman alarm to a simple threshold [puppet] - 10https://gerrit.wikimedia.org/r/341402 (https://phabricator.wikimedia.org/T70113) [21:04:02] (03PS2) 10Addshore: Enable Cognate on beta wiktionary sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341401 [21:04:14] because i have more, but in a few [21:04:44] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#3077452 (10Lydia_Pintscher) [21:04:48] 06Operations, 06Discovery, 10Traffic, 06WMDE-Tech-Communication, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3077451 (10Lydia_Pintscher) 05Open>03stalled [21:05:02] (03CR) 10EBernhardson: deployment-prep: Use apt experimental for elasticsearch servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/341398 (owner: 10EBernhardson) [21:05:35] (03PS1) 10Urbanecm: [fixup] Fix up wrongly updated sr.wikibooks and bs.wiktionary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341403 (https://phabricator.wikimedia.org/T159542) [21:07:43] Should I schedule the fixup patch above for next EU SWAT or would it be possible to merge and deploy it now? [21:08:04] Urbanecm: I can do it [21:08:11] hashar, thank you. [21:08:12] assuming there is nothing going on right now [21:08:13] jouncebot: now [21:08:14] For the next 0 hour(s) and 51 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T2100) [21:09:42] (03CR) 10Hashar: [C: 032] [fixup] Fix up wrongly updated sr.wikibooks and bs.wiktionary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341403 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [21:11:25] Urbanecm: it will merge eventually [21:12:31] ok [21:14:20] (03Merged) 10jenkins-bot: [fixup] Fix up wrongly updated sr.wikibooks and bs.wiktionary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341403 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [21:14:37] (03CR) 10jenkins-bot: [fixup] Fix up wrongly updated sr.wikibooks and bs.wiktionary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341403 (https://phabricator.wikimedia.org/T159542) (owner: 10Urbanecm) [21:15:06] (03PS5) 10Smalyshev: Add more metrics to Blazegraph monitoring [puppet] - 10https://gerrit.wikimedia.org/r/340695 [21:18:02] Urbanecm: ok pushing to mwdebug1001 [21:18:28] hashar, ack [21:18:50] Urbanecm: it is on mwdebug1001 if you can double check [21:18:53] hashar: https://phabricator.wikimedia.org/T159591 [21:19:34] hashar, checking [21:20:00] Krinkle: yeah I more or less seen that. It is a good idea :] [21:20:28] Krinkle: unrelated, there is a karma reporter that slightly enhance qunit console log https://integration.wikimedia.org/ci/job/mwext-qunit-jessie/8844/console :D [21:20:59] hashar, https://upload.wikimedia.org/wikipedia/commons/9/9c/Wiktionary_logo_bs-w-1x.png (source image) has different size than https://cs.wikipedia.org/static/images/project-logos/bswiktionary.png . [21:21:36] I just downloaded them, renamed them and commited them IIRC. [21:21:37] (03CR) 10Dzahn: [C: 032] Change Zuul Gearman alarm to a simple threshold [puppet] - 10https://gerrit.wikimedia.org/r/341402 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [21:21:40] Urbanecm: the one you sent in Gerrit has md5sum 931deebb9496a1938dd583af7cad5db2 [21:22:10] 06Operations: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3077541 (10RobH) [21:22:13] mutante: I should have used a threshold in the first place as you suggested. Might have to tweak the thresholds later on though [21:22:27] (03PS8) 10Dzahn: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [21:23:38] hashar: yea, some tweaking will be expected, but that's going to be easier [21:23:47] Krinkle: castor only saves ~/.npm not the node_modules dir [21:23:59] hashar: Ah, okay. That's good. [21:24:04] Krinkle: there is also a gotcha which is that for mediawiki and extensions the cache is shared for all of them [21:24:04] I was wondering about that [21:24:10] so it is definitely pilling up growing up [21:24:16] hashar, why the previous image (which was wrongly named -1x), the present image and the image on the server (I just downloaded, commited and uploaded) has different md5sums? [21:24:28] hashar: Yeah, but this means at least we are benefitting from the cache [21:24:35] I'll close this task, never mind :) [21:24:45] .npm cache is better than in the workspace [21:24:55] because of all the conflict race conditions etc. [21:24:58] Krinkle: maybe we can find a way to cache node_module though. That would speed it up [21:25:05] hashar, don't look at my question above, because I counted md5sum of different files... [21:25:13] the indirection of npm looking at the cache and copying it will validate everything and fallback to replacing with a new copy, which is perfect and self-correcting [21:25:48] hashar: Nah, even with a local directory, it should still contact npmjs.org to validate the current versions. It's just a matter of copying or linking to local directory, not worth it :) [21:25:53] Especially wihit all the bugs that will come along with it [21:26:07] Krinkle: in ruby world, eveyrthing ends up in ~/.gems or similar, and the equivalent of npm just tweak the PATH that is looked up [21:26:38] have you looked at yarn? The rewriting of npm by Facebook? [21:27:51] Urbanecm: should I deploy the hchange to the rest of the fleet? [21:28:08] hashar, examining it. It make me confused. [21:28:11] *makes [21:28:46] 06Operations, 10netops: netmon1002 networking setup - https://phabricator.wikimedia.org/T159757#3077581 (10RobH) [21:29:50] 06Operations: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3077541 (10RobH) [21:30:38] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:31:24] hashar, yes, please deploy it. [21:33:49] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:35:07] Urbanecm: doing [21:35:13] ack [21:35:49] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:35:56] criticial [21:36:03] mutante: and I made a typo obviously :/ [21:36:13] !log hashar@tin Synchronized static/images/project-logos: [fixup] Fix up wrongly updated sr.wikibooks and bs.wiktionary logos - T159542 T159534 (duration: 00m 42s) [21:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:19] T159542: Update bs.wiktionary logo - https://phabricator.wikimedia.org/T159542 [21:36:20] T159534: Update sr.wikibooks logo - https://phabricator.wikimedia.org/T159534 [21:37:01] (03PS1) 10Hashar: Typo in Zuul monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/341423 [21:37:07] hashar: should i check on contint? ok, got it :) [21:37:10] mutante: https://gerrit.wikimedia.org/r/341423 :( [21:38:18] aww, that took me a moment even in diff [21:38:44] (03PS2) 10Dzahn: zuul: typo in Icinga monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/341423 (owner: 10Hashar) [21:38:50] (03CR) 10Dzahn: [C: 032] zuul: typo in Icinga monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/341423 (owner: 10Hashar) [21:39:06] (03CR) 10Dzahn: [V: 032 C: 032] zuul: typo in Icinga monitoring probe [puppet] - 10https://gerrit.wikimedia.org/r/341423 (owner: 10Hashar) [21:39:12] (03PS1) 10Hashar: labstore: fix typo in snapshot-manager [puppet] - 10https://gerrit.wikimedia.org/r/341427 [21:39:38] chasemp: I found a nasty typo in labstore snapshot-manager.py . logging.criticial() (instead of critical) https://gerrit.wikimedia.org/r/341427 [21:39:43] Urbanecm: deployed! [21:40:10] (03CR) 10Hashar: "I have no idea about the impact. Just noticed that typo :]" [puppet] - 10https://gerrit.wikimedia.org/r/341427 (owner: 10Hashar) [21:40:42] hashar: is 2 enough cases for a line in "typos" file ? [21:40:48] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:40:49] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [21:40:55] hashar: ^ puppet ran :) [21:41:05] hashar, thank you [21:42:01] (03CR) 10Dzahn: [C: 031] labstore: fix typo in snapshot-manager [puppet] - 10https://gerrit.wikimedia.org/r/341427 (owner: 10Hashar) [21:42:29] (03PS9) 10Dzahn: toollabs: Preparing to move `/usr/local/bin/crontab` to labs/toollabs [puppet] - 10https://gerrit.wikimedia.org/r/336990 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999) [21:42:32] mutante: puppet is happy. Thanks [21:43:03] mutante: typo I dont know. Yeah maybe it is worth it [21:44:25] (03PS1) 10Dzahn: typos: add 'criticial' [puppet] - 10https://gerrit.wikimedia.org/r/341434 [21:48:14] !log bast3001 - schedule downtime for host and all services in Icinga, remove from puppet, salt .. (T159480) [21:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:20] T159480: Decommission bast3001 - https://phabricator.wikimedia.org/T159480 [21:48:28] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:48:59] hashar: i'm just waiting for jenkins-bot to confirm there are no other cases of it in the repo [21:50:37] hashar: mutante good find thanks [21:51:16] !log bast3001 - powerdown (T159480), decom in progress [21:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:29] chasemp: :) [21:53:10] (03CR) 10jerkins-bot: [V: 04-1] typos: add 'criticial' [puppet] - 10https://gerrit.wikimedia.org/r/341434 (owner: 10Dzahn) [21:53:28] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 3 below the confidence bounds [21:53:30] 06Operations, 10Internet-Archive, 06Offline-Working-Group: Create backups of Wikimedia content in diverse geographic places - https://phabricator.wikimedia.org/T156544#3077677 (10Pine) Hi @Ottomata, I too am not understanding why this would be low priority. Can you please explain? [21:53:34] heh @ jenkins-bot [21:53:55] (03PS1) 10Dzahn: site.pp: remove bast3001 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/341441 (https://phabricator.wikimedia.org/T159480) [21:54:05] 06Operations, 10Traffic: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3077681 (10JKatzWMF) @Ottomata Thanks for voicing your uncertainty. I am also uncertain/confused about the cause (T87276), or the solution as it is a bit out of my technical d... [21:54:08] it finds the one in snapshot-manager, so far so good [21:54:53] (03CR) 10Dzahn: [C: 031] "after that we can rebase https://gerrit.wikimedia.org/r/#/c/341434/ and see if it gets V+2 then" [puppet] - 10https://gerrit.wikimedia.org/r/341427 (owner: 10Hashar) [21:56:39] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds amusso It is broken and being replaced. [21:57:24] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [22:00:04] dapatrick, bawolff, and Reedy: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T2200). Please do the needful. [22:05:10] (03CR) 10Dzahn: [C: 032] site.pp: remove bast3001 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/341441 (https://phabricator.wikimedia.org/T159480) (owner: 10Dzahn) [22:07:10] 06Operations, 10hardware-requests, 13Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3077712 (10Dzahn) [22:08:04] (03CR) 10jerkins-bot: [V: 04-1] site.pp: remove bast3001 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/341441 (https://phabricator.wikimedia.org/T159480) (owner: 10Dzahn) [22:08:18] jerkins-bot? [22:08:47] there are 2 unaccepted salt keys, "lead" and "potassium" [22:08:49] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: can loose data if it crashes - https://phabricator.wikimedia.org/T159743#3077029 (10greg) I don't see why we need this task. There's always tons of changes that happen in an upstream that fix minor along with major issues. We don't need a task for t... [22:09:00] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: can lose data if it crashes - https://phabricator.wikimedia.org/T159743#3077716 (10greg) [22:09:06] if you know if they should be accepted or removed.. let me know [22:09:35] the other ones are already cleaned up [22:10:03] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit: can lose data if it crashes - https://phabricator.wikimedia.org/T159743#3077029 (10greg) 05Open>03Invalid Per T159744 also invaliding. [22:11:00] (I created T159759 about jerkins-bot) [22:11:01] T159759: "jerkins-bot" - https://phabricator.wikimedia.org/T159759 [22:11:24] (03PS2) 10Dzahn: site.pp: remove bast3001 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/341441 (https://phabricator.wikimedia.org/T159480) [22:12:26] 06Operations, 10Wikibugs: "jerkins-bot" - https://phabricator.wikimedia.org/T159759#3077768 (10greg) I noticed this recently, too. I presume it's a hack to make #operations team feel better about their -1s :) Only happens in their channel, afaict. [22:13:16] (03PS1) 10Dzahn: bast3001: remove production IPs, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/341448 (https://phabricator.wikimedia.org/T159480) [22:13:36] 06Operations, 10Wikibugs: "jerkins-bot" - https://phabricator.wikimedia.org/T159759#3077729 (10yuvipanda) I built the first version of this, and it didn't work. Someone fixed it later :D [22:14:17] 06Operations, 10Wikibugs: "jerkins-bot" - https://phabricator.wikimedia.org/T159759#3077729 (10Dzahn) i assumed jerkins-bot is just the alternative nickname it falls back to when jenkins-bot is taken, for example by its own ghost in case of netsplits or so.. [22:15:31] 06Operations, 10Wikibugs: "jerkins-bot" - https://phabricator.wikimedia.org/T159759#3077801 (10Luke081515) Hm, but I think "jerkins-bot" in this case means the gerrit-user, not the IRC-Nick? The IRC nick of the bot is currently wikibugs_, so I don't think this is related to nick-conflicts at IRC...? [22:15:50] 06Operations, 10Wikibugs: "jerkins-bot" - https://phabricator.wikimedia.org/T159759#3077802 (10greg) >>! In T159759#3077778, @yuvipanda wrote: > I built the first version of this, and it didn't work. Someone fixed it later :D where is "it"? [22:16:23] 06Operations, 10Wikibugs: "jerkins-bot" - https://phabricator.wikimedia.org/T159759#3077804 (10Dzahn) Yea, it's not related. Ignore that comment (deleted). [22:16:24] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:16:35] PROBLEM - OCG health on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:17:14] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=526.50 Read Requests/Sec=548.20 Write Requests/Sec=28.70 KBytes Read/Sec=35142.80 KBytes_Written/Sec=1446.40 [22:17:58] 06Operations, 10Wikibugs: "jerkins-bot" - https://phabricator.wikimedia.org/T159759#3077810 (10greg) >>! In T159759#3077802, @greg wrote: >>>! In T159759#3077778, @yuvipanda wrote: >> I built the first version of this, and it didn't work. Someone fixed it later :D > > where is "it"? nvm, see it: https://phab... [22:18:18] ACKNOWLEDGEMENT - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2633.80 Read Requests/Sec=914.70 Write Requests/Sec=1.00 KBytes Read/Sec=35779.60 KBytes_Written/Sec=33.20 daniel_zahn spikes are normal [22:18:38] 06Operations, 10Wikibugs: "jerkins-bot" - https://phabricator.wikimedia.org/T159759#3077813 (10greg) 05Open>03declined On purpose. Humor is good. Moving along. :) [22:21:02] (03PS3) 10Dzahn: site.pp: remove bast3001 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/341441 (https://phabricator.wikimedia.org/T159480) [22:21:09] (03CR) 10Dzahn: [V: 032 C: 032] site.pp: remove bast3001 (decom) [puppet] - 10https://gerrit.wikimedia.org/r/341441 (https://phabricator.wikimedia.org/T159480) (owner: 10Dzahn) [22:22:24] 06Operations, 10MediaWiki-API, 10Traffic: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3077838 (10Tgr) >>! In T155314#2945672, @Anomie wrote: > A rough idea might be to add code (probably in SessionBackend and `User::loadFromSession()`?) to set some f... [22:22:49] (03CR) 10Dzahn: [C: 032] bast3001: remove production IPs, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/341448 (https://phabricator.wikimedia.org/T159480) (owner: 10Dzahn) [22:25:00] 06Operations, 10hardware-requests, 13Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3077841 (10Dzahn) [22:25:47] 06Operations, 10ops-esams: Degraded RAID on bast3001 - https://phabricator.wikimedia.org/T154603#3077842 (10Dzahn) 05Open>03declined declined. we shut down bast3001 and replaced it with bast3002 and this hardware will be removed eventually. [22:26:20] 06Operations, 10hardware-requests, 13Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068928 (10Dzahn) [22:27:14] (03PS2) 10Dzahn: Add beta hewiktionary to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/341014 (https://phabricator.wikimedia.org/T158628) (owner: 10Addshore) [22:27:52] 06Operations, 10hardware-requests, 13Patch-For-Review: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3077862 (10Dzahn) a:05Dzahn>03RobH @Robh see checkboxes above, i did all that i could except the switch ports, can you do these and then forward the ticket? thank you [22:28:36] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3077867 (10Dzahn) [22:31:38] ottomata: i looked at servermon to check something else and noticed "analytics1040" is reported as not having talked to puppet master in a while [22:31:54] (03PS1) 10Dzahn: delete unused bastionhost::migration class [puppet] - 10https://gerrit.wikimedia.org/r/341451 (https://phabricator.wikimedia.org/T156506) [22:32:10] mutante: ja [22:32:11] ... [22:32:13] (03CR) 10Dzahn: [C: 032] Add beta hewiktionary to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/341014 (https://phabricator.wikimedia.org/T158628) (owner: 10Addshore) [22:32:23] https://phabricator.wikimedia.org/T159530 [22:32:41] elukey: is having partman woes [22:32:47] and is practicing reimaging that host [22:32:56] ottomata: aah, that one! gotcha, all cool [22:32:58] but he quit for the day, and didn't want puppet to spawn some daemons [22:33:00] i think he logged it [22:33:03] not sure [22:34:32] *nod*, thank you [22:35:27] (03PS2) 10Dzahn: delete unused bastionhost::migration class [puppet] - 10https://gerrit.wikimedia.org/r/341451 (https://phabricator.wikimedia.org/T156506) [22:36:14] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=131.60 Read Requests/Sec=183.20 Write Requests/Sec=63.00 KBytes Read/Sec=4321.20 KBytes_Written/Sec=371.20 [22:36:45] (03CR) 10Dzahn: [C: 032] delete unused bastionhost::migration class [puppet] - 10https://gerrit.wikimedia.org/r/341451 (https://phabricator.wikimedia.org/T156506) (owner: 10Dzahn) [22:37:13] jouncebot: now [22:37:13] For the next 1 hour(s) and 22 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170306T2200) [22:39:42] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3077920 (10RobH) [22:40:50] (03PS3) 10Dzahn: Add beta hewiktionary to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/341014 (https://phabricator.wikimedia.org/T158628) (owner: 10Addshore) [22:41:30] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3068928 (10RobH) a:05RobH>03mark I've disabled the switch port, and updated the task with the port assignment. Assigned to @mark for onsite wipe of disks and unracking. [22:41:44] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:43:51] (03PS4) 10Dzahn: add netmon1002 to site [puppet] - 10https://gerrit.wikimedia.org/r/333780 (https://phabricator.wikimedia.org/T159756) [22:44:43] (03CR) 10Dzahn: "i didn't make this today, this was just sitting here as stalled. the idea would be to activate the roles one-by-one" [puppet] - 10https://gerrit.wikimedia.org/r/333780 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [22:47:03] Reedy: i think here (https://gerrit.wikimedia.org/r/#/c/337248/) maybe just add python3-pil but don't remove python-imaging [22:47:12] per comments.. but hmm [22:48:50] 06Operations, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3077946 (10Dzahn) p:05Triage>03Normal [22:50:03] 06Operations, 10ops-esams, 10hardware-requests: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#3077948 (10RobH) [22:53:29] 06Operations, 10Analytics, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Build and Install librdkafka 0.9.4 on SCB - https://phabricator.wikimedia.org/T159379#3077953 (10Ottomata) By adding a new version of librdkafka to our apt repo, it has the chance that it might also be install... [22:55:28] 06Operations, 10ops-codfw, 10netops: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3077959 (10Papaul) [22:56:55] (03PS2) 10Volans: Add support for batch processing [software/cumin] - 10https://gerrit.wikimedia.org/r/341310 [23:02:50] 06Operations, 10Education-Program-Dashboard, 03Programs-and-Events-Dashboard-Sprint 2, 07Spike: Spike: What do we have to package to run the Programs and Events dashboard on production? - https://phabricator.wikimedia.org/T126295#3078016 (10Ragesoss) The initial work related to this is captured here: https... [23:07:54] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4197051 keys, up 126 days 14 hours - replication_delay is 48 [23:09:44] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [23:12:37] (03PS1) 10Papaul: Partman: Add ms-be20[2-3][0-9] [puppet] - 10https://gerrit.wikimedia.org/r/341460 [23:14:54] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3078075 (10RobH) [23:15:13] 06Operations, 10ops-codfw, 10hardware-requests, 13Patch-For-Review: Reclaim/Decommission old codfw mc2001->mc2016 hosts - https://phabricator.wikimedia.org/T157675#3012720 (10RobH) 05Open>03Resolved removed descriptions from disabled switch ports, resolving task. [23:17:54] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4197055 keys, up 126 days 14 hours - replication_delay is 648 [23:19:18] 06Operations, 07Epic, 03Interactive-Sprint, 06Maps (Maps-data): Epic: backup vector tiles - https://phabricator.wikimedia.org/T159770#3078079 (10MaxSem) [23:19:48] 06Operations, 10ops-codfw, 10netops: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3078094 (10RobH) [23:21:13] papaul: did someone partially setup those switch configs already? [23:21:18] cuz the row a ones are already there [23:21:35] also seems odd that its linked to a ram check on an unrelated system? [23:22:22] 06Operations, 07Epic, 03Interactive-Sprint, 06Maps (Maps-data): Epic: backup vector tiles - https://phabricator.wikimedia.org/T159770#3078079 (10Pnorman) I'm not aware of anyone that does this with either vector or raster tiles. For schema changes they generally have an ability to roll back to a previous v... [23:23:14] 06Operations, 10ops-codfw, 10netops: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3078101 (10RobH) Papaul: This seems to be a duplicate of T158714, but they have different info for some of the ports. Also this links to a task about wtp2019 currently, whic... [23:23:46] 06Operations, 10ops-codfw, 10netops: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3078104 (10RobH) a:05RobH>03Papaul Basically we'll need to know if the details in this task are correct, or if previously done T158714 is correct. [23:23:52] 06Operations, 03Interactive-Sprint: import_waterlines is broken - https://phabricator.wikimedia.org/T159771#3078107 (10MaxSem) [23:29:37] robh: some are using the mc servers ports that we decom [23:29:58] mc2001-mc2016 [23:30:05] papaul: that wasnt my question [23:30:12] you crated a new task today about ms-be switch ports [23:30:18] that appears to be a duplicate of anothe rtask [23:30:25] see https://phabricator.wikimedia.org/T159765#3078104 =] [23:30:49] 06Operations, 10ops-codfw, 10netops: codfw:ms-be2028-ms-be2039 switch port configuration - https://phabricator.wikimedia.org/T158714#3078145 (10RobH) [23:31:35] papaul: so now there are two different tasks for me to setup the switch ports on the exact same servers. One that I did weeks ago https://phabricator.wikimedia.org/T158714 and one that you created today https://phabricator.wikimedia.org/T159765 [23:31:55] and some of the port info is the same, and some is different. [23:32:18] I havent touched the setup of your new mc system ports yet. [23:32:26] (i was getting to that and found this other one) [23:32:43] (make sense?) [23:33:25] robh: looking give me a minute [23:33:40] no worries [23:35:24] robh: the correct one is https://phabricator.wikimedia.org/T159765#3078104 [23:36:17] robh: the other one was done before mc server were decom [23:36:32] ok, next time you should just leave them in place since the assignemnts were done [23:36:35] or open the old task [23:36:43] this new task is linked as a sub task to some unrelated task as well [23:36:50] so its a bit confusing, there was no explanation on the tasks. [23:37:03] So all the changes on prevoiusly done https://phabricator.wikimedia.org/T158714 have to be undone [23:37:13] and the changes on request on https://phabricator.wikimedia.org/T159765 done [23:37:26] that sound right? [23:37:36] robh: was my mistake sorry about that [23:38:08] just dont change them in the future like that [23:38:11] its not ideal [23:38:15] once they are set, leave them in place [23:38:32] ill do this now, since its confusing, and will get to the mc stuff later [23:39:23] robh:if there are aready set no need to change anything [23:39:37] ii will just rewired them tomorrow [23:40:04] and leave the other ports like it was [23:40:11] the only differences are.... [23:40:35] papaul: the differences arent that great [23:40:44] i'll jsut redo but the confusion was my only concern [23:40:59] its also linked to some random task so i'll unlink and link to the setup task for ms-be systems [23:41:11] 06Operations, 03Interactive-Sprint: import_waterlines is broken - https://phabricator.wikimedia.org/T159771#3078107 (10Pnorman) From https://phabricator.wikimedia.org/T159631#3078163, this job should - Produce logging to see what happened in the past - Report errors - Have monitoring on the results to see tha... [23:41:16] papaul: actually [23:41:35] sorry, that was a misping! [23:41:39] robh: just ignore the task I create today [23:41:43] arrowed up into a mid sentence... [23:42:15] robh: i will delete it we can keep the first one [23:42:40] well, just reject it then and comment is fine [23:42:40] robh: what i have to do si just move cables around to match the first task [23:42:44] but i can do the differences [23:42:45] robh: ok [23:43:06] robh: will just reject the task [23:43:10] also i dont see an mc port task [23:43:13] no need to do the differences [23:43:25] and like you said it is not link to the right task as well [23:43:31] ok, make sure you move them back to whats on https://phabricator.wikimedia.org/T158714 [23:43:38] or else they wont match config [23:43:40] yes i will [23:44:38] Im looking for a task to setup your mc systems, i recall you saying i needed to do that? [23:44:40] but i dont see one. [23:44:47] (the switch ports i mean.) [23:45:30] If its not in yet, you dont need to rush to put it in now! [23:45:35] I just wanted to make sure I wasn't blocking you. [23:46:17] robh: no you are not [23:46:40] robh: waiting for the DAC cables [23:47:11] 06Operations, 10ops-codfw: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3078192 (10Papaul) [23:47:15] 06Operations, 10ops-codfw, 10netops: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3078190 (10Papaul) 05Open>03declined [23:47:48] oh yeah, those are ordered! [23:53:24] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues