[02:18:35] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.241 second response time [02:22:17] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:42] 10Operations, 10Domains, 10Traffic: Redirecting incoming queries to non-existent subpages - https://phabricator.wikimedia.org/T212914 (10Thomas_Shafee) We're currently looking at moving off godaddy. Eventually we hope to have the Wikimedia take over hosting of the domain name ([[https://meta.wikimedia.org/wi... [03:37:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 976.29 seconds [04:09:27] Error [04:09:27] Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes." [04:09:31] :o [04:10:06] 'PHP fatal error: entire web request took longer than 60 seconds and timed out [04:10:08] ' [04:42:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.21 seconds [05:39:31] (03PS1) 10BryanDavis: toolforge: Add python3-venv to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/482578 [05:40:47] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:43:05] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 80726 bytes in 0.134 second response time [05:56:39] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 4.664 second response time [06:00:23] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:29:49] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:30:07] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:37:23] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.575 second response time [06:41:07] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:55:53] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:56:11] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:10:25] HI [07:10:28] I have question [07:11:02] How I can limit moving pages at category namespaces for users with autopatrol and bigger rights on project? [07:11:27] (disable moving categories for users without autopatrol right) [07:20:17] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.787 second response time [07:23:59] PROBLEM - pdfrender on scb1002 is CRITICAL: HTTP CRITICAL - No data received from host [07:24:12] !log restart pdfrender on scb1002 [07:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:13] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [07:27:28] (03PS1) 10Elukey: Decommission analytics103[6,7] from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/482582 (https://phabricator.wikimedia.org/T209929) [07:28:42] (03CR) 10Elukey: [C: 03+2] Decommission analytics103[6,7] from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/482582 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [07:41:11] (03PS1) 10Zoranzoki21: Restrict moving categories for users at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 [07:45:02] (03CR) 10Urbanecm: [C: 04-1] Restrict moving categories for users at srwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 (owner: 10Zoranzoki21) [07:46:49] (03PS2) 10Zoranzoki21: Restrict moving categories for users at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 [07:47:01] (03CR) 10Zoranzoki21: Restrict moving categories for users at srwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 (owner: 10Zoranzoki21) [07:48:00] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 (owner: 10Zoranzoki21) [07:50:43] (03CR) 10Elukey: "Adding more people to get feedback, from pcc it seems that the code does what it is supposed to do but I am not sure about if the puppet c" [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [08:02:49] (03PS1) 10Urbanecm: New throttle rule for University of Southern California editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482586 (https://phabricator.wikimedia.org/T212917) [08:03:38] (03CR) 10jerkins-bot: [V: 04-1] New throttle rule for University of Southern California editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482586 (https://phabricator.wikimedia.org/T212917) (owner: 10Urbanecm) [08:04:55] PROBLEM - Memory correctable errors -EDAC- on kafka1014 is CRITICAL: 4 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1014&var-datasource=eqiad%2520prometheus%252Fops [08:05:47] the link seems broken --^ [08:06:43] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [08:13:41] (03CR) 10Muehlenhoff: wmcs::nfs::misc - Refactor into profile/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [08:16:54] (03CR) 10Muehlenhoff: wmcs::nfs::misc - Refactor into profile/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [08:23:26] elukey: indeed, mind filing a task? [08:25:01] godog: will do! [08:26:40] godog: I am checking base::monitoring::host, it seems that we explicitly add %20 rather than &, maybe it is just a matter to file a code change? [08:27:45] ah no that is a - probably, or something similar [08:27:49] yes will file a task [08:29:57] elukey: yeah looks like too much url quoting [08:30:41] !log rolling restart of swift backend servers to pick up OpenSSL security update [08:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:20] 10Operations: base::monitoring::host's alarm dashboard links are broken - https://phabricator.wikimedia.org/T213052 (10elukey) [08:33:34] 10Operations: base::monitoring::host's alarm dashboard links are broken - https://phabricator.wikimedia.org/T213052 (10elukey) [08:39:40] (03PS3) 10Zoranzoki21: Restrict moving categories for users at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 (https://phabricator.wikimedia.org/T213050) [08:43:47] (03PS1) 10Zoranzoki21: Enable signature button in toolbar for the "Arbitration" namespace in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482591 (https://phabricator.wikimedia.org/T213049) [08:57:04] (03PS1) 10Muehlenhoff: Swift: Drop support for older distros [puppet] - 10https://gerrit.wikimedia.org/r/482593 [09:03:26] (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler1002/14175/" [puppet] - 10https://gerrit.wikimedia.org/r/482593 (owner: 10Muehlenhoff) [09:10:01] 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10serviceops, 10User-jijiki: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10fgiunchedi) Also all but the chattier thumbor logs are already sent to logstash (cfr T150734), this will it'll be a good occasion to move thumbor to t... [09:12:33] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: re-create script for manual paging - https://phabricator.wikimedia.org/T82937 (10Peachey88) [09:13:02] 10Operations, 10Icinga, 10monitoring: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048 (10Peachey88) [09:13:31] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hadoop/data/c/yarn/logs] [09:14:16] 10Operations, 10Puppet: Cleanup debconf handling in mailman puppet setup - https://phabricator.wikimedia.org/T144933 (10Peachey88) [09:14:32] 10Operations, 10Puppet: Decrease the amount of IRC spam in case of widespread puppet failures - https://phabricator.wikimedia.org/T188602 (10Peachey88) [09:15:04] 10Operations, 10monitoring: base::monitoring::host's alarm dashboard links are broken - https://phabricator.wikimedia.org/T213052 (10fgiunchedi) [09:15:57] 10Operations: SSL address space separation - https://phabricator.wikimedia.org/T83736 (10Peachey88) [09:17:34] can someone disable https://phabricator.wikimedia.org/p/Emma6969/ please - spam [09:22:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! deployment-prep is still on jessie though that doesn't affect this review." [puppet] - 10https://gerrit.wikimedia.org/r/482593 (owner: 10Muehlenhoff) [09:28:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (see typo in commit message)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) (owner: 10Elukey) [09:32:51] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: es2019 is not responsive - https://phabricator.wikimedia.org/T212833 (10Banyek) The comparison finished, and the data is OK. [09:36:28] !log depooling db1079 for schema change - T85757 [09:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:32] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:36:41] 10Operations, 10monitoring: base::monitoring::host's alarm dashboard links are broken - https://phabricator.wikimedia.org/T213052 (10Peachey88) [09:36:57] 10Operations, 10Icinga, 10monitoring: base::monitoring::host's alarm dashboard links are broken - https://phabricator.wikimedia.org/T213052 (10Peachey88) [09:37:02] (03CR) 10Banyek: [C: 03+2] mariadb: depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481840 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:40:19] (03CR) 10Elukey: systemd::syslog: allow to add the 'stop' rule when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) (owner: 10Elukey) [09:40:47] (03PS2) 10Elukey: systemd::syslog: allow to add the 'stop' rule when needed [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) [09:40:54] (03PS3) 10Elukey: systemd::syslog: allow to add the 'stop' rule when needed [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) [09:41:23] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "In MediaWiki core it says" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [09:42:13] (03PS3) 10Banyek: mariadb: depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481840 (https://phabricator.wikimedia.org/T85757) [09:45:19] (03PS2) 10Zoranzoki21: .gitignore: Add Visual Studio Code in editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 [09:45:40] (03CR) 10Zoranzoki21: "> In MediaWiki core it says" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [09:46:55] (03CR) 10Zoranzoki21: [C: 04-1] New throttle rule for University of Southern California editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482586 (https://phabricator.wikimedia.org/T212917) (owner: 10Urbanecm) [09:47:42] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: depool db1079 for schema change - T85757 (duration: 01m 02s) [09:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:44] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:47:52] 10Operations, 10Traffic: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Ricordisamoa) Thank you all for investigating. What would you recommend as the best practice w.r.t. URL length in client code? E.g. POST requests, hard-coded limits, retrying on 414-coded responses... [09:49:30] (03PS1) 10Elukey: role::analytics_cluster::refinery: unused, deleting it. [puppet] - 10https://gerrit.wikimedia.org/r/482608 [09:50:24] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::refinery: unused, deleting it. [puppet] - 10https://gerrit.wikimedia.org/r/482608 (owner: 10Elukey) [09:53:45] !log stopping replication on db1079 - T85757 [09:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:47] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:53:47] (03CR) 10jenkins-bot: mariadb: depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481840 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:55:40] !log executing schema change on db1079 with replication enabled - T85757 [09:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:37] (03PS1) 10Zoranzoki21: Restrict moving categories for users at srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) [09:57:12] (03PS2) 10Zoranzoki21: Restrict moving categories for users at srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) [10:08:35] (03CR) 10Gehel: [C: 03+1] "I only spot checked some of the classes, but it looks trivial enough (and a very good idea)." [puppet] - 10https://gerrit.wikimedia.org/r/481818 (owner: 10Giuseppe Lavagetto) [10:16:32] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:19:56] (03PS2) 10Filippo Giunchedi: Onboard group1 to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481825 (https://phabricator.wikimedia.org/T211124) [10:21:03] (03CR) 10Volans: "REplies inline, I'll wait an agreement to send the follow up PS." (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:21:20] PROBLEM - MariaDB Slave Lag: s7 on db1079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1592.11 seconds [10:21:25] PROBLEM - MariaDB Slave Lag: s7 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1597.64 seconds [10:21:43] (03PS3) 10Zoranzoki21: Update groupOverrides for srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482609 (https://phabricator.wikimedia.org/T213055) [10:21:44] jynus: this is the welcome back from the cluster ;) [10:22:10] lol [10:22:22] it was missing his father [10:22:24] hah! known/expected? [10:22:35] neither for me, banyek ? [10:22:54] unless I'm blind seems already recovered on 1079/1125 and moved to the lasbsdb behind them [10:23:07] It was me [10:23:09] *labsdb(s) [10:23:19] it was muted, but it seems for too small period of time [10:23:24] I gave 1800 seconds :/ [10:23:27] ah ha [10:23:33] so a schema change? [10:23:40] yes [10:23:43] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 4 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Lydia_Pintscher) [10:23:53] and they were depooled? [10:24:06] yes [10:24:07] ACKNOWLEDGEMENT - MariaDB Slave Lag: s7 on db1079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1739.48 seconds Banyek T85757 [10:24:07] ACKNOWLEDGEMENT - MariaDB Slave Lag: s7 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1743.56 seconds Banyek T85757 [10:24:20] ok, no harm, then [10:24:42] yep, everything is under control :) [10:24:47] (03CR) 10Gehel: [C: 04-1] "see comments inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:24:58] (03PS4) 10GTirloni: Limit manifest starts (max 10) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/479181 (https://phabricator.wikimedia.org/T107878) [10:25:39] (03CR) 10jerkins-bot: [V: 04-1] Limit manifest starts (max 10) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/479181 (https://phabricator.wikimedia.org/T107878) (owner: 10GTirloni) [10:25:41] (03PS1) 10Hashar: admin: test for absent users [puppet] - 10https://gerrit.wikimedia.org/r/482611 [10:26:09] (03CR) 10jerkins-bot: [V: 04-1] admin: test for absent users [puppet] - 10https://gerrit.wikimedia.org/r/482611 (owner: 10Hashar) [10:26:30] I'll go ahead with moving group1 to new logging infra shortly [10:27:23] !log restarting replication on db1079 - T85757 [10:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:26] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:28:53] (03CR) 10Filippo Giunchedi: [C: 03+2] Onboard group1 to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481825 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [10:29:13] godog: good luck :) [10:29:21] hashar: thanks! [10:29:56] RECOVERY - MariaDB Slave Lag: s7 on db1079 is OK: OK slave_sql_lag Replication lag: 28.51 seconds [10:29:59] RECOVERY - MariaDB Slave Lag: s7 on db1125 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [10:30:04] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190107T1030). [10:30:49] !log repooling db1079 after schema change - T85757 [10:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:07] !log filippo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Move group1 to new logging infrastructure - T211124 (duration: 00m 45s) [10:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:10] T211124: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 [10:31:15] (03PS1) 10Banyek: Revert "mariadb: depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482612 [10:31:36] (03PS1) 10Elukey: profile::hive::client: add jdbc parameters [puppet] - 10https://gerrit.wikimedia.org/r/482613 (https://phabricator.wikimedia.org/T212256) [10:32:51] (03CR) 10jenkins-bot: Onboard group1 to new logging infrastructure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481825 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [10:32:54] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482612 (owner: 10Banyek) [10:34:02] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482612 (owner: 10Banyek) [10:35:03] (03CR) 10Elukey: [C: 03+2] profile::hive::client: add jdbc parameters [puppet] - 10https://gerrit.wikimedia.org/r/482613 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [10:36:31] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: repool db1079 after schema change - T85757 (duration: 00m 44s) [10:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:36] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:37:20] (03PS16) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [10:43:41] (03CR) 10Volans: "Thanks for the reviews, replies inline. I'll wait for an agreement on the few open questions before sending a followup PS." (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482299 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:46:01] (03CR) 10jenkins-bot: Revert "mariadb: depool db1079" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482612 (owner: 10Banyek) [10:49:30] (03CR) 10Filippo Giunchedi: [C: 03+1] systemd::syslog: allow to add the 'stop' rule when needed [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) (owner: 10Elukey) [10:51:31] (03PS17) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) [10:52:36] (03CR) 10GTirloni: wmcs::nfs::misc - Refactor into profile/role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [10:54:22] (03PS1) 10Elukey: profile::oozie::server: add jdbc parameter [puppet] - 10https://gerrit.wikimedia.org/r/482617 (https://phabricator.wikimedia.org/T212256) [10:57:59] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14177/" [puppet] - 10https://gerrit.wikimedia.org/r/482617 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [11:00:27] (03CR) 10Gehel: [C: 04-1] phabricator: add phabricator module (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [11:00:57] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10jcrespo) I would like to insist on this issue now that the holiday is over- while the service (parsercache) is not at the time affected, we are in a no-hw redundancy mode on eqiad, and after all... [11:04:27] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/481914 (owner: 10Volans) [11:06:32] 10Operations, 10Traffic: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Vgutierrez) As @ema suggests we should set the URL length limit as close to the client as possible, right now the nginx running in the cache nodes is that place. Usually retrying on a 4xx response co... [11:07:27] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) [11:11:10] (03PS1) 10Elukey: profile::hue: allow password definition via hiera [puppet] - 10https://gerrit.wikimedia.org/r/482618 (https://phabricator.wikimedia.org/T212256) [11:19:45] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14178/" [puppet] - 10https://gerrit.wikimedia.org/r/482618 (https://phabricator.wikimedia.org/T212256) (owner: 10Elukey) [11:21:12] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Revisit the logging work done on Q1 2017-2018 for the standard pod setup - https://phabricator.wikimedia.org/T207200 (10akosiaris) >>! In T207200#... [11:26:09] 10Operations, 10Traffic: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Ricordisamoa) @Vgutierrez my bad, I meant retrying with a shorter URL. [11:30:54] (03PS3) 10Filippo Giunchedi: Revert "Whitelist X-MediaWiki-Patrol-Status header in Swift" [puppet] - 10https://gerrit.wikimedia.org/r/482262 (https://phabricator.wikimedia.org/T167400) (owner: 10Gergő Tisza) [11:31:59] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Whitelist X-MediaWiki-Patrol-Status header in Swift" [puppet] - 10https://gerrit.wikimedia.org/r/482262 (https://phabricator.wikimedia.org/T167400) (owner: 10Gergő Tisza) [11:35:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:37:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:40:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add Google Translate MT config [puppet] - 10https://gerrit.wikimedia.org/r/471698 (https://phabricator.wikimedia.org/T90208) (owner: 10KartikMistry) [11:40:49] (03PS3) 10Alexandros Kosiaris: Add Google Translate MT config [puppet] - 10https://gerrit.wikimedia.org/r/471698 (https://phabricator.wikimedia.org/T90208) (owner: 10KartikMistry) [11:40:52] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add Google Translate MT config [puppet] - 10https://gerrit.wikimedia.org/r/471698 (https://phabricator.wikimedia.org/T90208) (owner: 10KartikMistry) [11:46:21] RECOVERY - ensure kvm processes are running on cloudvirt1024 is OK: PROCS OK: 2 processes with regex args qemu-system-x86_64 [11:48:49] PROBLEM - ensure kvm processes are running on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 [11:49:42] mmm [11:53:47] (03PS3) 10Bmansurov: Enable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476368 (https://phabricator.wikimedia.org/T209882) [11:54:53] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Migrate >=90% of existing Logstash traffic to the logging pipeline - https://phabricator.wikimedia.org/T205851 (10fgiunchedi) Looks like group1 is done now, I checked kibana mediawiki dashboards and everything seems in order (i.e. no regressions or log... [11:57:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] toolforge: Add missing php packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482481 (owner: 10BryanDavis) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190107T1200). [12:00:04] Tulsi, TBhagat, bmansurov, Jayprakash12345, Zoranzoki21, and onimisionipe: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:11] o/ [12:00:12] here [12:00:14] I can swat today [12:00:24] here o/ [12:00:30] I am ready. [12:00:35] bmansurov: go ahead if you are a deployer and deploy your patch, while I get ready [12:00:45] I can SWAT today, by the way :) [12:00:50] zeljkof: o/ I'm not a deployer [12:00:57] here [12:01:08] bmansurov: I'll get to your patches then soon :) [12:01:16] zeljkof: thanks! [12:01:20] (03PS1) 10GTirloni: cloudvirt1024: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/482623 (https://phabricator.wikimedia.org/T213071) [12:01:40] Tulsi, TBhagat, bmansurov, Jayprakash12345, Zoranzoki21, and onimisionipe: there are more patches than time, is any patch urgent, or should I just follow the calendar? [12:02:05] zeljkof: From my point of view, you can follow the calendar [12:02:06] (03CR) 10GTirloni: [C: 03+2] cloudvirt1024: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/482623 (https://phabricator.wikimedia.org/T213071) (owner: 10GTirloni) [12:02:20] I'm not sure about other patches, but mine is time sensitive. [12:02:25] !log kartik@deploy1001 Started deploy [cxserver/deploy@2d54a64]: Deploy Google Translation (T90208) [12:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:28] T90208: Create Google translate backend for cxserver - https://phabricator.wikimedia.org/T90208 [12:02:28] following the calendar sounds good to me [12:03:11] +1 bmansurov & Zoranzoki21 [12:03:16] bmansurov: time sensitive meaning it should be deployed during this swat? [12:03:23] zeljkof: yes [12:03:40] bmansurov: ok, I'll make sure it gets deployed [12:03:46] thanks [12:03:51] zeljkof: mine is not time sensitive [12:04:21] TBhagat: please stand by, I'll let you know when the first patch is ready for testing at mwdebug1002, let me know if you need help with testing there [12:04:54] @zeljkof: Sure. [12:05:29] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481538 (https://phabricator.wikimedia.org/T212662) (owner: 10Tulsi Bhagat) [12:06:53] (03CR) 10Zfilipin: Enable 'extendedmover' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481538 (https://phabricator.wikimedia.org/T212662) (owner: 10Tulsi Bhagat) [12:06:59] (03PS3) 10Zfilipin: Enable 'extendedmover' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481538 (https://phabricator.wikimedia.org/T212662) (owner: 10Tulsi Bhagat) [12:07:07] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481538 (https://phabricator.wikimedia.org/T212662) (owner: 10Tulsi Bhagat) [12:07:32] !log kartik@deploy1001 Finished deploy [cxserver/deploy@2d54a64]: Deploy Google Translation (T90208) (duration: 05m 07s) [12:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:34] T90208: Create Google translate backend for cxserver - https://phabricator.wikimedia.org/T90208 [12:08:11] (03Merged) 10jenkins-bot: Enable 'extendedmover' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481538 (https://phabricator.wikimedia.org/T212662) (owner: 10Tulsi Bhagat) [12:09:19] TBhagat: 481538 is at mwdebug1002, please test and let me know if I can deploy it [12:09:27] LGTM [12:09:34] ok, deploying [12:09:49] Ok [12:11:04] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:481538| Enable extendedmover user group at en.wiktionary (T212662)]] (duration: 00m 46s) [12:11:06] !log disabled notifications for cloudvirt0124 (T212360) [12:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:07] T212662: Request extendedmover user right at en.wiktionary - https://phabricator.wikimedia.org/T212662 [12:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:10] T212360: Create hostnames for old and new Toolforge bastions that make sense - https://phabricator.wikimedia.org/T212360 [12:11:16] TBhagat: it's deployed, please test at production [12:11:46] It's fine. [12:11:47] (03PS3) 10Zfilipin: Add 'suppressredirect' user right to editor user group at pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481513 (https://phabricator.wikimedia.org/T212655) (owner: 10Tulsi Bhagat) [12:12:24] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481513 (https://phabricator.wikimedia.org/T212655) (owner: 10Tulsi Bhagat) [12:13:27] (03Merged) 10jenkins-bot: Add 'suppressredirect' user right to editor user group at pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481513 (https://phabricator.wikimedia.org/T212655) (owner: 10Tulsi Bhagat) [12:13:48] (03PS3) 10Zfilipin: Enable Quiz extension on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481512 (https://phabricator.wikimedia.org/T212622) (owner: 10Tulsi Bhagat) [12:14:27] TBhagat: 481513 is at mwdebug, please test and let me know if I can deploy it [12:14:43] LGTM, Please deploy [12:14:49] ok, deploying [12:15:51] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:481513|Add suppressredirect user right to editor user group at pl.wikisource (T212655)]] (duration: 00m 44s) [12:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:54] T212655: Give editors in plwikisource the suppressredirect right - https://phabricator.wikimedia.org/T212655 [12:15:56] zeljkof: Hello, Good Evening [12:16:06] TBhagat: it's deployed, please test [12:16:14] Jayprakash12345: hi and good afternoon :) [12:16:16] (03CR) 10Gilles: [C: 03+1] Disable Navigation Timing on closed/private/fishbowl wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481212 (owner: 10Krinkle) [12:16:26] Looks good. [12:16:32] (03CR) 10jenkins-bot: Enable 'extendedmover' user group at en.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481538 (https://phabricator.wikimedia.org/T212662) (owner: 10Tulsi Bhagat) [12:16:34] (03CR) 10jenkins-bot: Add 'suppressredirect' user right to editor user group at pl.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481513 (https://phabricator.wikimedia.org/T212655) (owner: 10Tulsi Bhagat) [12:17:20] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481512 (https://phabricator.wikimedia.org/T212622) (owner: 10Tulsi Bhagat) [12:18:24] (03Merged) 10jenkins-bot: Enable Quiz extension on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481512 (https://phabricator.wikimedia.org/T212622) (owner: 10Tulsi Bhagat) [12:20:06] TBhagat: 481512 is at mwdebug, please test [12:20:23] LGTM [12:20:28] Please deploy [12:20:30] ok, deploying [12:21:25] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:481512|Enable Quiz extension on ru.wikibooks (T212622)]] (duration: 00m 45s) [12:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:28] T212622: Install Quiz on Russian Wikibooks - https://phabricator.wikimedia.org/T212622 [12:21:45] TBhagat: it's deployed, please test and thanks for deploying with #releng :) [12:22:08] bmansurov: please stand by, you are next [12:22:19] zeljkof: ok, I'm here [12:22:20] Jayprakash12345: please stand by, you are next soon :) [12:22:36] Thank you! It's fine. [12:22:59] zeljkof: i am ready But we can't test 482263. [12:23:11] (03PS4) 10Zfilipin: Enable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476368 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [12:23:42] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476368 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [12:24:43] (03PS1) 10Zoranzoki21: Cleanup old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482628 [12:24:47] (03Merged) 10jenkins-bot: Enable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476368 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [12:25:12] (03PS2) 10Zoranzoki21: Cleanup old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482628 [12:26:19] zeljkof: Jay is not here, I will take his patch, but it can be deployed directly because it is throttle rule [12:26:32] bmansurov: 476368 is at mwdebug1002, please test and let me know if I can deploy it [12:26:50] zeljkof: ok testing [12:27:00] Zoranzoki21: I am here :) [12:27:07] Jayprakash12345: ah, I see now that your patch is throttle rule :) cc Zoranzoki21 [12:27:30] Jayprakash12345: Oh, I didn`t see you [12:28:24] zeljkof: looks good, please go ahead and deploy [12:28:53] bmansurov: ok, deploying [12:29:37] (03CR) 10jenkins-bot: Enable Quiz extension on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481512 (https://phabricator.wikimedia.org/T212622) (owner: 10Tulsi Bhagat) [12:29:39] (03CR) 10jenkins-bot: Enable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476368 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [12:29:52] (03CR) 10Gehel: [C: 04-1] debmonitor: add debmonitor module (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482299 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [12:29:55] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:476368|Enable reader trust survey (T209882)]] (duration: 00m 45s) [12:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:58] T209882: Quicksurvey for reader trust - https://phabricator.wikimedia.org/T209882 [12:30:12] bmansurov: it's deployed, please test and thanks for deploying with #releng :) [12:30:14] !log tools.zoranzoki21wiki Archived https://www.mediawiki.org/w/index.php?title=Extension:Woopra (https://www.wikidata.org/wiki/Q21679347) - T212994 [12:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:17] T212994: Archive all pages/items on wikidata.org related to MediaWiki SVN extensions/skins (tracking) - https://phabricator.wikimedia.org/T212994 [12:31:00] zeljkof: thanks! [12:31:16] (03CR) 10Gehel: [C: 04-1] debmonitor: add debmonitor module (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482299 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [12:31:20] wrong channel :O [12:32:09] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482263 (https://phabricator.wikimedia.org/T212921) (owner: 10Jayprakash12345) [12:33:09] Zoranzoki21: happens :) [12:33:13] (03Merged) 10jenkins-bot: To lift a cap on account creation from IP for mrwiki community [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482263 (https://phabricator.wikimedia.org/T212921) (owner: 10Jayprakash12345) [12:33:30] zeljkof: Ok is everything, Jay is here :) But you can deploy this directly [12:34:38] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:482263|To lift a cap on account creation from IP for mrwiki community (T212921)]] (duration: 00m 43s) [12:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:41] T212921: To lift a cap on account creation from IP on 2019-01-11 - https://phabricator.wikimedia.org/T212921 [12:35:09] Jayprakash12345: 482263 is deployed, thanks for deploying with #releng :) cc Zoranzoki21 [12:35:10] (03PS3) 10Zoranzoki21: Cleanup old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482628 [12:36:06] Zoranzoki21: please stand by, you are next [12:36:19] Oh [12:36:23] Wrong button pressed :P [12:36:32] Zoranzoki21: please stand by, you are next :) [12:36:42] Zoranzoki21: OK [12:36:47] zeljkof: OK [12:36:49] :lol: [12:37:08] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482628 (owner: 10Zoranzoki21) [12:38:11] (03Merged) 10jenkins-bot: Cleanup old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482628 (owner: 10Zoranzoki21) [12:38:26] zeljkof: And this directly [12:38:28] (03Abandoned) 10MarcoAurelio: Clear expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481940 (owner: 10MarcoAurelio) [12:40:16] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:482628|Cleanup old throttle rules]] (duration: 00m 44s) [12:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:27] Zoranzoki21: 482628 is deployed [12:40:51] (03PS4) 10Zfilipin: Restrict moving categories for users at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 (https://phabricator.wikimedia.org/T213050) (owner: 10Zoranzoki21) [12:41:34] zeljkof: OK [12:42:50] (03CR) 10jenkins-bot: To lift a cap on account creation from IP for mrwiki community [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482263 (https://phabricator.wikimedia.org/T212921) (owner: 10Jayprakash12345) [12:42:52] (03CR) 10jenkins-bot: Cleanup old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482628 (owner: 10Zoranzoki21) [12:43:24] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 (https://phabricator.wikimedia.org/T213050) (owner: 10Zoranzoki21) [12:44:28] (03Merged) 10jenkins-bot: Restrict moving categories for users at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 (https://phabricator.wikimedia.org/T213050) (owner: 10Zoranzoki21) [12:45:21] Zoranzoki21: 482583 is at mwdebug1002, please test [12:45:28] zeljkof: Testing [12:46:44] zeljkof: LGTM [12:46:58] Zoranzoki21: ok, deploying [12:47:58] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:482583|Restrict moving categories for users at srwiki (T213050)]] (duration: 00m 44s) [12:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:02] T213050: Disable moving categories for users without bot, autopatrol, patrol, rollback, sysop and bureaucrat rights on srwiki - https://phabricator.wikimedia.org/T213050 [12:48:09] Zoranzoki21: it's deployed, please test [12:48:22] zeljkof: OK is [12:53:54] (03CR) 10Mobrovac: [C: 03+1] "LGTM. Side thought: it would probably be useful to have this in etcd so that it can be also used in cases where an app start logging too m" [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) (owner: 10Elukey) [12:54:49] onimisionipe: around for swat? [12:55:00] 10Operations, 10Wikimedia-Site-requests, 10Wikimedia-maintenance-script-run: Drop FlaggedRevs rights from users at srwikinews - https://phabricator.wikimedia.org/T212058 (10Zoranzoki21) Can anyone confirm to this is correct: `mwscript emptyUserGroup.php autoreview --wiki=srwikinews` `mwscript emptyUserGroup... [12:55:08] zeljkof: yes! [12:55:35] (03CR) 10jenkins-bot: Restrict moving categories for users at srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482583 (https://phabricator.wikimedia.org/T213050) (owner: 10Zoranzoki21) [12:55:58] (03PS11) 10Zfilipin: cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [12:56:18] onimisionipe: is 480829 testable at mwdebug1002? [12:56:19] (03PS24) 10Gehel: elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [12:56:32] (03PS5) 10GTirloni: Limit manifest starts (max 10) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/479181 (https://phabricator.wikimedia.org/T107878) [12:56:53] (it's not there yet, just checking) [12:57:15] zeljkof: Honestly.. I don't think so. But I'm not sure. [12:57:23] dcausse: ^ [12:57:55] onimisionipe: no it's not testable, the only thing you can test is running mwscript eval.php --wiki enwiki and printing the config var value [12:58:02] (03CR) 10GTirloni: "It looks like CI isn't happy with `unstable` but `stretch` works fine." [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/479181 (https://phabricator.wikimedia.org/T107878) (owner: 10GTirloni) [12:58:15] (03CR) 10Gehel: [C: 03+2] elasticsearch: allow cross cluster communication [puppet] - 10https://gerrit.wikimedia.org/r/481125 (https://phabricator.wikimedia.org/T212434) (owner: 10Mathew.onipe) [12:58:23] onimisionipe, dcausse: should I just deploy it then and be ready to revert in case of trouble? :) [12:58:51] zeljkof: yes, this var is only read by maint script so nothing should go wrong [12:58:52] zeljkof: yes. I doubt there would be any trouble :) [12:59:15] if the value is not right we'll make a followup patch [12:59:16] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [12:59:21] thanks :) [12:59:25] zeljkof: Thanks! [12:59:42] onimisionipe, dcausse: ok, I'll let you know when it's deployed, in a minute or two [13:00:07] Ok [13:00:20] (03Merged) 10jenkins-bot: cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [13:01:21] dcausse, onimisionipe: new ferm rules for elastic merged, I ran puppet on elastic1034 without issue, I'll wait for puppet to do its thing on its own for the others [13:01:35] gehel: great! [13:01:41] gehel: Thanks! [13:01:47] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:480829|cirrus: increase number of shards (T212224)]] (duration: 00m 44s) [13:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:50] T212224: Reshard enwiki_general - https://phabricator.wikimedia.org/T212224 [13:02:04] onimisionipe, dcausse: it's deployed, please test and thanks for deploying with #releng :) [13:02:11] !log EU SWAT finished [13:02:11] zeljkof: thanks! :) [13:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:25] (03CR) 10jenkins-bot: cirrus: increase number of shards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480829 (https://phabricator.wikimedia.org/T212224) (owner: 10Mathew.onipe) [13:15:54] (03CR) 10GTirloni: [C: 03+1] ircecho: Migrate from OptionParser to ArgumentParser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480760 (owner: 10Paladox) [13:17:05] (03PS7) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 [13:17:20] (03CR) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480760 (owner: 10Paladox) [13:17:29] (03PS8) 10Paladox: ircecho: Migrate from OptionParser to ArgumentParser [puppet] - 10https://gerrit.wikimedia.org/r/480760 [13:17:38] (03PS5) 10Paladox: ircecho: Drop sysvinit support [puppet] - 10https://gerrit.wikimedia.org/r/480789 [13:46:03] (03PS1) 10GTirloni: Revert "cloudvirt1024: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/482637 (https://phabricator.wikimedia.org/T213071) [13:46:50] (03CR) 10GTirloni: [C: 03+2] Revert "cloudvirt1024: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/482637 (https://phabricator.wikimedia.org/T213071) (owner: 10GTirloni) [13:52:44] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:19] (03PS2) 10ArielGlenn: make misc dumps logging use console or console and file [dumps] - 10https://gerrit.wikimedia.org/r/481483 (https://phabricator.wikimedia.org/T212349) [13:53:50] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 80079 bytes in 0.136 second response time [13:54:59] (03CR) 10ArielGlenn: [C: 03+2] make misc dumps logging use console or console and file [dumps] - 10https://gerrit.wikimedia.org/r/481483 (https://phabricator.wikimedia.org/T212349) (owner: 10ArielGlenn) [13:56:31] !log ariel@deploy1001 Started deploy [dumps/dumps@acd9bca]: logging and quiet mode for adds-changes and other dumps [13:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:36] !log ariel@deploy1001 Finished deploy [dumps/dumps@acd9bca]: logging and quiet mode for adds-changes and other dumps (duration: 00m 05s) [13:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:15] (03PS1) 10ArielGlenn: make addschanges dumps quieter [puppet] - 10https://gerrit.wikimedia.org/r/482639 (https://phabricator.wikimedia.org/T212349) [14:00:06] (03PS1) 10Hashar: Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) [14:01:04] (03CR) 10ArielGlenn: [C: 03+2] make addschanges dumps quieter [puppet] - 10https://gerrit.wikimedia.org/r/482639 (https://phabricator.wikimedia.org/T212349) (owner: 10ArielGlenn) [14:01:57] (03PS4) 10Elukey: systemd::syslog: allow to add the 'stop' rule when needed [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) [14:04:50] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: logstash HTTP Basic Auth prompt says "WMF Labs" - https://phabricator.wikimedia.org/T207178 (10fgiunchedi) >>! In T207178#4848376, @Tgr wrote: > Note that Chrome does not display a basic auth prompt. Maybe the same message could be put in the error show... [14:04:56] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: logstash HTTP Basic Auth prompt says "WMF Labs" - https://phabricator.wikimedia.org/T207178 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [14:05:18] (03Abandoned) 10ArielGlenn: make adds-changes dump quieter [puppet] - 10https://gerrit.wikimedia.org/r/481484 (https://phabricator.wikimedia.org/T212349) (owner: 10ArielGlenn) [14:09:16] (03CR) 10Elukey: [C: 03+2] systemd::syslog: allow to add the 'stop' rule when needed [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) (owner: 10Elukey) [14:10:03] (03PS3) 10ArielGlenn: pylint and pep8 for scripts related to media tarball creation [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280108 [14:10:14] (03CR) 10Hashar: "Note we have some rspec test for rsyslog.conf.erb in modules/systemd/spec/defines/systemd_syslog_spec.rb :)" [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) (owner: 10Elukey) [14:10:26] (03CR) 10jerkins-bot: [V: 04-1] pylint and pep8 for scripts related to media tarball creation [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280108 (owner: 10ArielGlenn) [14:15:21] 10Operations: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [14:18:03] (03CR) 10Elukey: [C: 03+2] "> Note we have some rspec test for rsyslog.conf.erb in" [puppet] - 10https://gerrit.wikimedia.org/r/482327 (https://phabricator.wikimedia.org/T212915) (owner: 10Elukey) [14:18:13] thanks hashar! didn't see those [14:18:40] (03PS1) 10Elukey: profile::analytics::systemd_timer: force syslog stop [puppet] - 10https://gerrit.wikimedia.org/r/482642 (https://phabricator.wikimedia.org/T212915) [14:19:28] !log added jbond to WMF-LDAP group in Phabricator (T213079) [14:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:31] T213079: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 [14:20:34] (03CR) 10GTirloni: [C: 03+1] toolforge: profile::toolforge::toolviews::mysql_password [labs/private] - 10https://gerrit.wikimedia.org/r/482238 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis) [14:21:38] 10Operations: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10jbond) [14:22:58] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14180/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/482642 (https://phabricator.wikimedia.org/T212915) (owner: 10Elukey) [14:23:33] (03CR) 10GTirloni: [V: 03+2 C: 03+1] toolforge: profile::toolforge::toolviews::mysql_password [labs/private] - 10https://gerrit.wikimedia.org/r/482238 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis) [14:32:43] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10elukey) All right so from now on the analytics systemd timers by default will not log into syslog/daemon.log, this should help preventing this issue again. Good... [14:40:16] (03CR) 10Zfilipin: [C: 03+1] "Should I deploy this tomorrow (Tuesday) during EU SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [14:40:26] 10Operations: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [14:41:56] 10Operations: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10jbond) [14:48:30] 10Operations: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10jbond) [14:51:21] (03PS1) 10Elukey: Introduce role::analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/482645 (https://phabricator.wikimedia.org/T212256) [14:57:50] (03CR) 10Herron: "LGTM. However, a 3 day waiting period applies here. Barring any objections will move forward with this on Weds." [puppet] - 10https://gerrit.wikimedia.org/r/482483 (https://phabricator.wikimedia.org/T213015) (owner: 10Krinkle) [14:59:27] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Add krinkle to contint-docker group - https://phabricator.wikimedia.org/T213015 (10herron) Patch looks good to me. However, a 3 day waiting period applies here. Barring any obje... [15:00:58] (03PS2) 10Elukey: Introduce role::analytics_test_cluster [puppet] - 10https://gerrit.wikimedia.org/r/482645 (https://phabricator.wikimedia.org/T212256) [15:02:22] PROBLEM - puppet last run on ms-be1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:08:32] 10Operations, 10Recommendation-API, 10Research, 10SRE-Access-Requests, and 3 others: Add Baha as a deployer for Recommendation API - https://phabricator.wikimedia.org/T212945 (10herron) [15:12:22] (03PS1) 10MSantos: Restore settings for maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/482648 (https://phabricator.wikimedia.org/T205462) [15:13:18] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 847 MB (1% inode=56%) [15:14:15] (03PS1) 10Elukey: Decommission analytics10[39-41] from Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/482649 (https://phabricator.wikimedia.org/T209929) [15:14:32] RECOVERY - Disk space on contint1001 is OK: DISK OK [15:18:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10herron) 05Open→03Resolved a:03herron Good to close! (can always re-open if we need to follow up) Thanks @Elukey! [15:20:28] 10Operations, 10Availability, 10User-Elukey, 10Wikimedia-Incident: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730 (10elukey) 05Open→03Resolved a:03elukey Closing this since nutcracker has been replaced by mcrouter. Please re-o... [15:20:59] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10elukey) a:05elukey→03RobH [15:21:12] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:16] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 80074 bytes in 0.817 second response time [15:24:11] (03PS2) 10Herron: logstash::collector add input identifier tags [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) [15:25:49] (03CR) 10Herron: logstash::collector add input identifier tags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [15:25:56] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10elukey) 05Open→03Stalled a:05RobH→03elukey Didn't realize that the task was still assigned to me, apologies :) This is a good thing though since Analy... [15:26:02] (03CR) 10Ottomata: [C: 03+1] [WIP] admin: allow users to be deployed without ssh keys configured (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [15:26:19] (03PS7) 10Volans: phabricator: add phabricator module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) [15:26:21] (03PS2) 10Volans: debmonitor: add debmonitor module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482299 (https://phabricator.wikimedia.org/T205884) [15:26:29] (03CR) 10Ottomata: [WIP] admin: allow users to be deployed without ssh keys configured (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [15:26:46] (03CR) 10Volans: phabricator: add phabricator module (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:26:53] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Add krinkle to contint-docker group - https://phabricator.wikimedia.org/T213015 (10herron) p:05Triage→03Normal [15:26:55] (03CR) 10Volans: debmonitor: add debmonitor module (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482299 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:27:10] 10Operations, 10ops-eqiad: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10herron) p:05Triage→03High [15:28:14] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: Convert Phabricator mail config to use cluster.mailers - https://phabricator.wikimedia.org/T212989 (10herron) p:05Triage→03Normal [15:28:28] RECOVERY - puppet last run on ms-be1044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:30:36] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 (10herron) [15:30:45] the Disk space on contint1001 error 15 minutes ago was me [15:30:45] 10Operations, 10serviceops, 10vm-requests, 10Release-Engineering-Team (Watching / External): Increase mwdebugXXXX hosts CPU and memory(?) - https://phabricator.wikimedia.org/T212955 (10herron) p:05Triage→03High [15:32:36] (03CR) 10Elukey: "Thanks a lot for the review! I've kept the module as close as possible to the current version to avoid messing it up too much, and I'd als" [puppet] - 10https://gerrit.wikimedia.org/r/482275 (https://phabricator.wikimedia.org/T212949) (owner: 10Elukey) [15:34:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, but pep8 needs some handling first" [puppet] - 10https://gerrit.wikimedia.org/r/482611 (owner: 10Hashar) [15:38:54] (03PS2) 10Bstorm: toolforge: Add python3-venv to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/482578 (owner: 10BryanDavis) [15:39:28] 10Operations, 10Performance-Team (Radar), 10User-Elukey: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963 (10elukey) 05Open→03Resolved a:03elukey Going to close this task to open another one that tracks the upgrade to buster or stretch, this one is full of... [15:40:14] (03CR) 10Bstorm: [C: 03+2] toolforge: Add python3-venv to exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/482578 (owner: 10BryanDavis) [15:41:36] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 30.27 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:43:11] (03PS3) 10Daimona Eaytoy: Rename globals and rights in AbuseFilter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 [15:46:59] !log replacing bad fuse on the PDU rack A2 eqiad [15:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:05] (03CR) 10Filippo Giunchedi: "LGTM, though looks like at least one input doesn't have 'tags' parameter:" [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [15:47:44] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 74.91 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:49:01] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10RobH) >>! In T205507#4859289, @elukey wrote: > Didn't realize that the task was still assigned to me, apologies :) > > Would it be feasible to keep these two h... [15:49:31] (03CR) 10Herron: "> LGTM, though looks like at least one input doesn't have 'tags'" [puppet] - 10https://gerrit.wikimedia.org/r/480791 (https://phabricator.wikimedia.org/T205849) (owner: 10Herron) [15:49:58] PROBLEM - Host ms-be1044 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:04] PROBLEM - Host an-worker1079 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:11] 10Operations, 10Performance-Team, 10User-Elukey: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) p:05Triage→03Normal [15:50:20] PROBLEM - Host kafka1013 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:27] ouch --^ [15:50:28] PROBLEM - Host an-worker1078 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:35] this is surely the last PDU that died [15:50:36] sigh [15:50:40] godog: --^ [15:50:48] OR cmjohnson1 is working on it [15:50:58] hahaha yes didn't see SAL [15:51:01] thanks Chris! [15:51:04] PROBLEM - Host ms-be1045 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:04] PROBLEM - Host db1082 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:20] PROBLEM - Host db1107 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:28] but why the hosts are down? aren't we doing one at a time? [15:51:36] are the hosts suppose to go down in the process though? [15:51:44] exactly [15:51:55] we have 8 DB there (cc jynus too) [15:51:55] yeah I wasn't expecting any impact too [15:51:58] RECOVERY - Host db1107 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:52:08] RECOVERY - Host an-worker1078 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:52:12] RECOVERY - Host an-worker1079 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:52:44] is it all recovered now? [15:52:53] not sure.. [15:53:07] if it was power or network at this point [15:53:19] 10Operations, 10ops-eqiad, 10Analytics, 10decommission, and 2 others: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10elukey) 05Stalled→03Open a:05elukey→03RobH Nevermind then, I can easily use only analytics1028->41, we are good to decom. Thanks! [15:53:19] still cannot ssh also on the recovered one [15:53:20] I was able to briefly login into db1107, and it rebooted [15:53:24] now I can't anymore [15:53:25] canntt be power [15:53:35] mysql do not start automatically [15:53:39] this is the rack in question, for reference: https://netbox.wikimedia.org/dcim/racks/2/ [15:53:40] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:53:44] context for this is T212861 [15:53:44] T212861: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 [15:53:48] mmm, that's bad [15:54:00] well, dbproxy is not in use [15:54:03] I replaced both fuses but now the problem still exists and some servers went down that should not have gone down [15:54:04] 1004 I mean [15:54:12] PROBLEM - Host an-worker1078 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:22] PROBLEM - Host an-worker1079 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:24] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:54:49] we can live without ms-be hosts for a while, do we need to depool db hosts ? [15:55:00] an-worker offline also I'm assuming is fine for a while [15:55:03] db1075 is S3 master [15:55:04] RECOVERY - Host an-worker1079 is UP: PING WARNING - Packet loss = 54%, RTA = 0.27 ms [15:55:04] PROBLEM - Host db1107 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:06] RECOVERY - Host an-worker1078 is UP: PING WARNING - Packet loss = 64%, RTA = 0.37 ms [15:55:09] yep fine fo an-worker [15:55:22] RECOVERY - Host ms-be1044 is UP: PING WARNING - Packet loss = 44%, RTA = 0.20 ms [15:55:23] we also have m4 single host and all the others are slaves [15:55:23] I am going to depool db1082 [15:55:26] RECOVERY - Host ms-be1045 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:55:40] in theory, with the last mediawiki patch we should be ok [15:55:47] except for the ongoing connections [15:55:57] but better evaluate after the fact [15:56:09] yuck, its a dual wide pdu too [15:56:10] RECOVERY - Host db1082 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:56:13] which is non ideal for swap [15:56:18] RECOVERY - Host kafka1013 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:56:25] I still cannot connect to db1107 for example [15:56:28] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 on the technical side of this and thanks for working on it. But the comment about first needing to get bastions off trusty holds" [puppet] - 10https://gerrit.wikimedia.org/r/482118 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:56:31] probably rebooting still [15:56:41] db1107 is a backup host I think [15:56:50] so it should not affect live services? [15:57:22] PROBLEM - MariaDB Slave IO: s5 on db1124 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1082.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1082.eqiad.wmnet (110 Connection timed out) [15:57:36] (03CR) 10Alexandros Kosiaris: "Indeed. Managed to create a mess I see. I 'll post a new patch" [puppet] - 10https://gerrit.wikimedia.org/r/482046 (owner: 10Fsero) [15:57:44] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Narrow down ferm etcd allow_from"" [puppet] - 10https://gerrit.wikimedia.org/r/482655 [15:58:27] can't see an increase in 5xx no [15:58:37] PROBLEM - mysqld processes on db1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:58:52] (03PS1) 10Jcrespo: mariadb: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482656 [15:59:02] I'm testing all of them with cumin [15:59:02] ok, then godog that means load balancer is working as intended :-) [15:59:05] PROBLEM - MariaDB Slave IO: s5 on db1082 is CRITICAL: CRITICAL slave_io_state could not connect [15:59:08] db1107.eqiad.wmnet still off [15:59:23] but please volans double check my patch ^anyway [15:59:25] ouch that's the el-master, didn't see it [15:59:32] jynus: sure [15:59:33] PROBLEM - MariaDB Slave SQL: s5 on db1082 is CRITICAL: CRITICAL slave_sql_state could not connect [15:59:52] PROBLEM - MariaDB read only s5 on db1082 is CRITICAL: Could not connect to localhost:3306 [15:59:55] I want mediawiki db stable first [16:00:06] then I can check the rest of the services [16:00:09] (03CR) 10Volans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482656 (owner: 10Jcrespo) [16:00:36] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482656 (owner: 10Jcrespo) [16:00:45] by pure chance I had the repo updated [16:00:50] I was just doing it [16:01:25] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482656 (owner: 10Jcrespo) [16:01:26] those are the hosts that were rebooted: [16:01:27] an-worker1078.eqiad.wmnet | an-worker1079.eqiad.wmnet | cloudelastic1001.wikimedia.org | db1082.eqiad.wmnet | kafka1013.eqiad.wmnet | ms-be1044.eqiad.wmnet | ms-be1045.eqiad.wmnet [16:01:44] I stopped EL writes to db1107 [16:01:47] and I guess db1107.eqiad.wmnet too as it's down [16:02:20] oh, is it affected? [16:02:22] PROBLEM - MariaDB Slave Lag: s5 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 796.00 seconds [16:02:26] I thought it was only the proxy [16:02:47] db1124 is not an issue at the moment, it is the sanitarium [16:02:51] it seems that the other DBs didn't reboot [16:02:52] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1082 (duration: 00m 45s) [16:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:36] I think connection errors went down (trials) [16:03:44] https://logstash.wikimedia.org/goto/f300376138ffac7526519d2ba8badd03 [16:03:53] !log stop eventlogging mysql consumers on eventlog1002 and eventlogging replication on db1108 due to issues with db1107 [16:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:57] thanks elukey [16:04:03] I was about to suggest that [16:04:12] you were fast [16:04:18] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-consumer@mysql-m4-master-00 eventlogging-consumer@mysql-eventbus [16:04:29] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team: Kubernetes Production Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10Ottomata) [16:04:37] ah yes downtime! [16:04:38] ok, I think mediawiki web is stable, even without load balancer complains [16:05:02] let me put aside EL for a second and check misc [16:05:13] (03PS2) 10Alexandros Kosiaris: Narrow down ferm etcd allow_from. Take #2 [puppet] - 10https://gerrit.wikimedia.org/r/482655 [16:05:40] I see db1117 up and running [16:05:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Needs some more work to include nicely the nodes" [puppet] - 10https://gerrit.wikimedia.org/r/482655 (owner: 10Alexandros Kosiaris) [16:06:03] oh, wrong host, I confused that with db1107 [16:06:06] jynus: it's 1107, not 17 ;) [16:06:13] I'm attached to the console [16:06:16] black screen so far [16:06:22] ok, hard reboot [16:06:23] I can exit if you want to try [16:06:27] I trust you [16:06:48] meanwhile let me downtime db1082 [16:07:07] ack [16:07:20] please tell me any update you have [16:07:23] PROBLEM - MariaDB Slave Lag: s5 on db1082 is CRITICAL: CRITICAL slave_sql_lag could not connect [16:07:25] @volans [16:07:43] !log powercycle db1107 [16:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:56] :( [16:08:00] jynus: ^^^ [16:08:00] * elukey trusts volans as well [16:08:11] monitoring the console, while reboots [16:08:17] thanks a lot [16:08:45] I will take over when os starts [16:08:48] (03CR) 10jenkins-bot: mariadb: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482656 (owner: 10Jcrespo) [16:09:27] kernel booting [16:09:45] jynus: at login prompt, all yours [16:09:54] RECOVERY - Host db1107 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:09:55] many fs errors? [16:09:56] (03PS16) 10Paladox: gerrit: Update PolyGerrit theme plugin to customise the header either more [puppet] - 10https://gerrit.wikimedia.org/r/482379 [16:10:44] (03PS17) 10Paladox: gerrit: Add colour to PolyGerrit header and update the theme slightly [puppet] - 10https://gerrit.wikimedia.org/r/482379 [16:10:52] jynus: not enough to notice them while it was booting, checking them now [16:11:29] I may need your help elukey for the service [16:11:44] to decide to keep the server or to failover [16:11:56] I will boot the server and see [16:12:13] XFS just logger starting/ending recovery, I don't see orphan inodes or stuff like that [16:12:24] !log starting inplace reindexing for enwiki - T212224 [16:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:27] T212224: Reshard enwiki_general - https://phabricator.wikimedia.org/T212224 [16:12:29] volans: cool [16:12:43] PROBLEM - mysqld processes on db1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:13:08] jynus: so atm nothing is pushing to db1107, I'd prefer to resurrect it rather than failing over [16:13:11] thanks :) [16:13:14] Indeed [16:13:27] just pinging you should the worst case scenario [16:13:40] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2047 - https://phabricator.wikimedia.org/T212966 (10Papaul) a:05Papaul→03Marostegui Disk replacement complete [16:13:46] in theory the raid cache should make it trivial [16:13:50] nothing horrible can happen if Riccardo is around [16:13:51] :D [16:13:53] btw mysql was logging failed event due to missing user, just FYI [16:13:55] but that is only a theory [16:13:55] lol [16:14:31] a few SATA link down on boot :-/ [16:15:03] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10Cmjohnson) I replaced the fuse on the wrong side initially and caused an outage. I then replaced the fuses on the correct phase and the power was not restored, I tried replacing them... [16:15:40] !log starting mariadb on db1107 [16:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:09] innodb booted nicely [16:16:14] but we will see toku [16:16:27] RECOVERY - mysqld processes on db1107 is OK: PROCS OK: 1 process with command name mysqld [16:17:06] I am doing a table check [16:17:27] no errors or recovery, which is good but strange [16:18:06] elukey: log into the mysql service and see if you see something strange, other than what volans commented [16:18:20] ack, thanks a lot [16:20:18] (checking quickly the an-workers) [16:20:48] Checking db1124 on s5 [16:21:00] banyek: that is db1082 being down [16:21:10] db1124 should be fine [16:21:28] oh, ok [16:21:34] lag, but that is the least of the concerns [16:21:47] shouldn't even page [16:21:48] (03CR) 10Gehel: [C: 03+1] phabricator: add phabricator module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:22:42] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10Documentation: Kubernetes Production Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10Aklapper) [16:22:49] elukey: from my side, things are ok [16:23:04] leaving to you the decision of reenabling the writes [16:23:08] (03CR) 10Thcipriani: "change looks good with the new UI. In the old UI this creates a white logo on a white background." [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [16:23:31] I would however do a reboot [16:23:32] let's make sure that any follow up work on those PDUs is synchronized with the service owners of the hosts [16:23:35] to upgrade [16:23:46] and make sure it boots cleanly again [16:23:49] RECOVERY - Long running screen/tmux on an-coord1001 is OK: OK: No SCREEN or tmux processes detected. [16:23:51] +1 [16:23:59] elukey: let me know when I can do that [16:23:59] (for the reboot idea) [16:24:04] anytime jynus [16:24:09] ok, doing [16:24:27] !log shutting down mariadb again and rebooting db1107 [16:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:32] (03CR) 10Paladox: "@Thicipraini, did you also copy the GerritSite.css change too?" [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [16:24:49] (03PS1) 10Vgutierrez: site: Keep lvs2010 as a spare system while T203194 is not solved [puppet] - 10https://gerrit.wikimedia.org/r/482658 (https://phabricator.wikimedia.org/T203194) [16:24:51] (03CR) 10Jforrester: [C: 03+1] "Oh, thanks, I had this locally but hadn't pushed it yet. Good to land whenever." [puppet] - 10https://gerrit.wikimedia.org/r/482492 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [16:26:10] (03CR) 10Vgutierrez: [C: 03+2] site: Keep lvs2010 as a spare system while T203194 is not solved [puppet] - 10https://gerrit.wikimedia.org/r/482658 (https://phabricator.wikimedia.org/T203194) (owner: 10Vgutierrez) [16:27:13] James_F: Got any more info on the s4 == commons issues? [16:27:14] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10Documentation: Kubernetes Production Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10thcipriani) [16:27:19] We should probably document/task them [16:27:36] (03CR) 10Gehel: [C: 03+1] debmonitor: add debmonitor module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482299 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:27:41] Reedy: Yeah, will file a task. [16:27:45] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10Documentation: Kubernetes Production Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10thcipriani) p:05Triage→03Normal [16:27:46] Cheers [16:27:54] PROBLEM - mysqld processes on db1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:28:16] ops [16:28:24] ^see log [16:28:26] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10Documentation: TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10akosiaris) [16:28:41] (03CR) 10Paladox: "I see svg referenced, also it can't find the png file." [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [16:29:02] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T212990 (10Cmjohnson) [16:29:24] (03CR) 10Paladox: "Oh, meh, I'm using the master branch so gwtui is no longer available." [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [16:29:37] PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:30:06] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T212990 (10Cmjohnson) helium is out of warranty, I created a procurement task to purchase a replacement disk. [16:30:25] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10Documentation: TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10akosiaris) Yeah it's part of the TEC3 goal to do so (it's Outcome 6/Output 6.1 under https://www.mediawiki.... [16:31:09] PROBLEM - Check systemd state on lvs2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:31:10] 10Operations, 10Icinga, 10monitoring: base::monitoring::host's alarm dashboard links are broken - https://phabricator.wikimedia.org/T213052 (10Volans) a:03Volans [16:31:17] lvs2010 is me [16:31:18] so I mixed db1107 and db1117 that is why I wanted to attend misc first [16:31:22] jynus: I try to check whan't on db1082 but `journalctl -u mariadb` has nothing [16:31:34] that can wait [16:31:52] it is depooled and load balancer worked ok [16:32:01] so not worried about that [16:32:07] db1107? [16:32:14] being a replica, maybe we can just reclone it [16:32:15] RECOVERY - Check systemd state on lvs2010 is OK: OK - running: The system is fully operational [16:32:18] (about db1082) [16:32:40] I am worried about db1107 as it was the primary host [16:32:50] and it doesn't use Innodb [16:33:21] ok, db1107 is rebooting [16:33:41] anything else down, aside from 82 and its related replication? [16:34:06] dbproxy1004/9 [16:34:10] which are not in use [16:34:25] so can be reloaded on db1107 boot [16:35:12] 10Operations, 10Traffic, 10netops, 10Goal: Increase network capacity (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T207668 (10ayounsi) 05Open→03Resolved All done [16:35:22] I just realized the "SATA link down" was on different ports, so that is not a concern [16:37:18] elukey: eventhing should be ok and upgraded now [16:37:50] I propose 2 later actions- check the permissions that were failing that volans mentioned and send a notification to analytics [16:38:03] do you want to try to restart the writes? [16:38:50] 10Operations, 10serviceops: SRE FY2019 Q3 goal: Increase reach of deployment pipeline - https://phabricator.wikimedia.org/T212935 (10fselles) [16:38:55] jynus: ack, going to restart writes in a sec.. I quickly checked the perm error message and it looks something that can be dropped, I'll open a task [16:39:13] yes, just as a followup I mean, not now [16:39:39] I kinda give you the controls now and will be around if you need me [16:39:59] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. [16:40:13] I will reenable the alerts on db1107 [16:40:28] sure :) just restarted writes, waiting for log entries that confirm that all is good [16:40:52] in theory a transactional system should not lose a single event but [16:41:11] we cannot account for app bad logic on bad state [16:41:13] yeah we pull from kafka and insert to mysql, we should be good [16:41:28] or tokudb, which was not *the best* in the past [16:42:06] or even the hardware cache having issues- like when the raind controller crashes [16:42:24] banyek: wannt reload the 4 and 9 proxies? [16:42:28] *wanna [16:42:48] yep, I do it [16:42:48] (they don't roll back automatically) [16:43:17] I will ack db1124, maybe failover it to codfw later [16:43:37] * banyek does the reloading of proxies [16:43:47] thanks! [16:44:35] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [16:44:43] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [16:45:15] thank you elukey and volans for the help [16:46:51] jynus: yw, I hoped you could had a more uneventful welcome back from vacations... [16:47:07] actually I am happy [16:48:23] (03CR) 10Bstorm: [C: 03+1] "> Patch Set 5:" [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/479181 (https://phabricator.wikimedia.org/T107878) (owner: 10GTirloni) [16:48:24] because the work of anomie, TimS (someone in platform) ? I am not sure who to thank, mediawiki users were not affected at all [16:48:40] that's true indeed! [16:48:44] so that is really good news [16:49:11] all good from my side! [16:49:31] I just realized that some el logs are broken [16:49:38] oh [16:49:46] I was tailing one without realizing that the last entry was in october [16:49:48] is is because of this, or existing issue? [16:49:49] sigh [16:49:59] ah [16:49:59] no no probably because one of the recent merges/refactoring [16:50:05] lol [16:50:06] I bet my fault :D [16:50:13] I can help with alerting [16:50:13] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10Cmjohnson) The disk at slot 1 is failed, the server is out of warranty but I do have a spare 4TB SATA. cmjohnson@analytics1054:~$ sudo megacli -PDList -aALL |grep "Firmware... [16:50:36] feel free to create a ticket and I can show you some things we have for that that you can reuse [16:50:43] 10Operations, 10serviceops: SRE FY2019 Q3 goal: Increase reach of deployment pipeline - https://phabricator.wikimedia.org/T212935 (10fselles) [16:50:47] (03CR) 10Dzahn: [C: 04-2] "alright, thanks everybody, i understand. i will keep this in waiting/stalled for now to wait for the bastions" [puppet] - 10https://gerrit.wikimedia.org/r/482118 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [16:50:52] (not to fix it, to alert on missing logs) [16:51:26] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10Cmjohnson) @elukey the disk still shows failed do you have to manually add it back? [16:51:35] jynus: ack! [16:53:53] (03PS8) 10Volans: phabricator: add phabricator module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) [16:53:55] 10Operations, 10ops-eqiad: frdb1001 RAID controller battery failure - https://phabricator.wikimedia.org/T212556 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson This was resolved over the holiday break 12/27/2018 [16:53:56] (03PS3) 10Volans: debmonitor: add debmonitor module [software/spicerack] - 10https://gerrit.wikimedia.org/r/482299 (https://phabricator.wikimedia.org/T205884) [16:54:10] (03CR) 10Volans: phabricator: add phabricator module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/482018 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:54:11] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10Milimetric) p:05Normal→03High [16:55:00] (03CR) 10Bstorm: "So this is stretch only, basically? Just checking my understanding of the intent." [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:55:06] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10fgiunchedi) [16:55:37] 10Operations, 10monitoring, 10Graphite, 10MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), 10MW-1.27-release-notes: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141 (10Milimetric) [16:55:39] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), 10Services (watching): eventbus should send statsd in batches - https://phabricator.wikimedia.org/T141524 (10Milimetric) 05Open→03Declined won't fix this because we're working on a new implementation [16:56:14] cmjohnson1: o/ - sorry I didn't get the question about analytics1054 [16:56:25] 10Operations, 10Analytics, 10Analytics-Cluster, 10DBA: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Milimetric) @Dzahn wikimetrics is going to be sunset this quarter, so you won't have to worry about that any more. [16:56:56] 10Operations, 10fundraising-tech-ops, 10netops: Refresh Minfraud IP list - https://phabricator.wikimedia.org/T213100 (10cwdent) [16:57:07] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), 10Services (watching): eventbus should send statsd in batches - https://phabricator.wikimedia.org/T141524 (10Pchelolo) And the new implementation is based on #service-runner which batch stats by default. [16:57:12] (03CR) 10GTirloni: "> Patch Set 17:" [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [16:57:18] 10Operations, 10Analytics, 10Analytics-Cluster, 10DBA: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Ottomata) OK FINE I'LL DO IT [16:57:29] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10DBA: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Milimetric) a:03Ottomata [16:57:37] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10DBA: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Milimetric) p:05Normal→03High [16:58:02] (03PS1) 10Vgutierrez: tlsproxy: Set http2_max_field_size to 8k [puppet] - 10https://gerrit.wikimedia.org/r/482666 (https://phabricator.wikimedia.org/T209590) [16:58:19] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10fgiunchedi) re: ms-fe hosts, if possible please do not co-locate onto the same rack in row A. Less stringent but the more ms-be hosts are spread out the better. [16:58:57] RECOVERY - Device not healthy -SMART- on db2047 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2047&var-datasource=codfw%2520prometheus%252Fops [17:00:29] PROBLEM - HHVM rendering on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:00:29] PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time [17:00:51] PROBLEM - Apache HTTP on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [17:01:37] RECOVERY - HHVM rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 80195 bytes in 0.194 second response time [17:01:37] RECOVERY - Nginx local proxy to apache on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.071 second response time [17:01:59] RECOVERY - Apache HTTP on mw1317 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.053 second response time [17:03:31] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10elukey) >>! In T213038#4859708, @Cmjohnson wrote: > @elukey the disk still shows failed do you have to manually add it back? Sorry Chris didn't get the question - do you mea... [17:08:01] (03PS2) 10Gehel: Restore settings for maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/482648 (https://phabricator.wikimedia.org/T205462) (owner: 10MSantos) [17:09:47] 10Operations, 10fundraising-tech-ops, 10netops: Refresh Minfraud IP list - https://phabricator.wikimedia.org/T213100 (10cwdent) iptables change e6acebc new minfraud ip range deployed [17:10:23] (03CR) 10Gehel: [C: 03+2] Restore settings for maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/482648 (https://phabricator.wikimedia.org/T205462) (owner: 10MSantos) [17:11:46] mateusbs17|brb: ^ [17:12:47] gehel:👍 [17:14:37] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10elukey) a:03Cmjohnson [17:15:26] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10Cmjohnson) @elukey sorry, i replaced the disk and it is still showing failed, I don't know if the disk needs to be manually added back to the array? [17:16:21] (03CR) 10Gehel: "I'm not against merging this if it is an issue in the short term. But cleaning up cumin itself seems like a better solution. There is alre" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/481858 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [17:19:05] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:27] !log kartik@deploy1001 Started deploy [cxserver/deploy@594420b]: Update cxserver to 7632c43 [17:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:01] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10RobH) >>! In T212861#4859499, @Cmjohnson wrote: > 1. Do we want to leave these servers with non-redundant power until we can replace the PDU with a new one that should be ordered soon?... [17:22:06] (03PS1) 10Jforrester: Move testcommonswiki from s4 to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482670 [17:22:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Update label and switch to rename labvirt1013 to cloudvirt1013 - https://phabricator.wikimedia.org/T212522 (10Cmjohnson) 05Open→03Resolved updated [17:22:56] (03CR) 10jerkins-bot: [V: 04-1] Move testcommonswiki from s4 to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482670 (owner: 10Jforrester) [17:23:33] !log kartik@deploy1001 Finished deploy [cxserver/deploy@594420b]: Update cxserver to 7632c43 (duration: 04m 06s) [17:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:47] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10Documentation: TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10Ottomata) Awesome, those are good starts, I'll check those out thank you! [17:26:14] 10Operations, 10ops-eqiad, 10DC-Ops: Update label and switch to rename labvirt1014 to cloudvirt1014 - https://phabricator.wikimedia.org/T210927 (10Cmjohnson) 05Open→03Resolved Done [17:28:47] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 2.924 second response time [17:29:12] (03CR) 10Reedy: "Jaime said on IRC that there's probably no point bothering doing this (but he will hold us to deleting/removing/cleaning up the wiki when " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482670 (owner: 10Jforrester) [17:29:32] Reedy: S4 is hard-sharded, so we probably need to do it anyway. [17:29:52] Reedy: Try to make an edit. [17:29:57] [16:20:33] as I said, is it has an expery date, I prefer to not touch it [17:29:57] [16:20:39] *as [17:29:57] [16:22:30] I prefer to actually make sure it is deleted at a later time, rather than make you work more on that [17:30:55] Error: 1146 Table 'testcommonswiki.blobs_cluster25' doesn't exist (10.64.32.65) [17:31:08] Just need to create those tables [17:31:25] I guess it's just fallout from the replag errors we experienced [17:31:41] Reedy: _clusterNN changes depending on the page you try to edit. [17:32:26] Probably.. But if the script had ran properly... Would have all the required tables on the relevant clusters have been created? [17:32:37] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:30] Reedy: S4 is likely manually coded, whereas S3 Just Works™. [17:35:36] !log restart pdfrender [17:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:41] !log restart pdfrender on scb1004 [17:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:07] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [17:36:45] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Cmjohnson) Sodium does not have any failed disks. One of the disks is listed as a hotspare. cmjohnson@sodium:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Ho... [17:37:09] James_F: Both give the same two values for $wgDefaultExternalStore [17:37:26] (same as testwiki too! :o) [17:38:02] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) @jcrespo An email was sent to Dell requesting a new board. I have not received a response [17:38:49] Reedy: Hmm. [17:42:13] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and configure frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Jgreen) [17:50:21] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: ORES services should bind to ores config files - https://phabricator.wikimedia.org/T210719 (10Halfak) @akosiaris what do you think about this strategy? [17:52:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1004: SMART/disk error - https://phabricator.wikimedia.org/T209029 (10Cmjohnson) a ticket has been opened with Dell You have successfully submitted request SR984761946. [17:53:43] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10elukey) @Cmjohnson so I got a different than usual output from: ` elukey@analytics1054:~$ sudo megacli -PDList -aAll | grep Firm Firmware state: Online, Spun Up Device Firmw... [17:55:36] 10Operations, 10CirrusSearch, 10Discovery-Search, 10serviceops: Find an alternative to HHVM curl connection pooling for PHP 7 - https://phabricator.wikimedia.org/T210717 (10EBernhardson) Even more generally, it we install a reverse proxy for local TLS connection pooling on the application servers, does med... [17:56:21] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Jgreen) [17:56:27] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10RobH) @jgreen: Can you advise what network ports need to be attached, and to what switch? [17:57:37] RECOVERY - Device not healthy -SMART- on helium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=helium&var-datasource=eqiad%2520prometheus%252Fops [18:00:04] gehel and onimisionipe: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190107T1800). [18:00:04] Smalyshev: A patch you scheduled for Wikidata Query Service weekly deploy is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:50] here here [18:01:03] cool [18:01:09] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Jgreen) [18:01:15] is gehel around for https://gerrit.wikimedia.org/r/c/operations/puppet/+/480894 maybe? [18:02:01] we can deploy without it for now if he's not [18:02:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] ores: Notify ores services when the config changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476850 (https://phabricator.wikimedia.org/T210719) (owner: 10Ladsgroup) [18:05:17] !log deactivate bgp sessions to Zayo on T212791 [18:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:20] T212791: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 [18:06:58] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@d8f911c]: (no justification provided) [18:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:02] !log onimisionipe@deploy1001 deploy aborted: (no justification provided) (duration: 00m 04s) [18:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:02] 10Puppet, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: ORES services should bind to ores config files - https://phabricator.wikimedia.org/T210719 (10akosiaris) I 've left comments in the change, but to answer the question of the strategy, keep in mind this moves orchestratio... [18:10:03] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@d8f911c]: new GUI, Updater & Blazegraph build [18:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:12] !log manually creating tables on es1015, es1017 with replication for testcommonswiki [18:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:56] (03CR) 10Gergő Tisza: "There is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [18:12:59] James_F: ^ es blob tables created [18:13:43] PROBLEM - MariaDB Slave SQL: es2 on es2014 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1005, Errmsg: Error Cant create table testcommonswiki.blobs_cluster24 (errno: 140 Wrong create options) on query. Default database: testcommonswiki. [Query snipped] [18:14:08] oh, that is bad [18:14:24] at least it is only codfw [18:15:04] (I hope) [18:15:26] ooh [18:15:41] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10Cmjohnson) I will need to create space in the 10G racks to make this work and some juggling will be required. I need to move several servers out of rack A4 to make room for (3)ms-be servers, c... [18:17:06] interesting [18:17:10] that is an infra bug [18:17:19] it has a key block size [18:17:34] maybe from the time it used to be tokudb or something [18:17:37] and the creation failed [18:18:18] at least was catches without many issues [18:18:40] !log activate bgp sessions to Zayo on cr1-eqiad - T212791 [18:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:43] T212791: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 [18:19:24] SMalyshev: I'm here, looking [18:19:32] Reedy: fixed [18:19:40] gehel: thanks [18:19:51] RECOVERY - MariaDB Slave SQL: es2 on es2014 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:19:53] I should have just ran the create table rather than a create table like [18:19:54] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Dzahn) Though the nagios plugin calls this "degraded". @Volans Is this maybe a bug in the check script? ` [sodium:~] $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not inclu... [18:19:59] * gehel had to check on the kids [18:20:17] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@d8f911c]: new GUI, Updater & Blazegraph build (duration: 10m 13s) [18:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:05] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 (10Cmjohnson) Replaced the optics @ayounsi please resolve once confirmed all is well. [18:22:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6461/IPv4: Connect, AS6461/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:55] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10jcrespo) I am creating a subtask to fix db1082, which may have to be reimaged because the power loss. [18:23:38] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10jcrespo) ^CC @Marostegui so you know why db1082 + db1124 + labsdb replication (s5) are broken or stopped [18:24:15] (03Abandoned) 10Jforrester: Move testcommonswiki from s4 to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482670 (owner: 10Jforrester) [18:24:19] SMalyshev, onimisionipe: I'm running puppet compiler on that change and I'll deploy [18:24:28] onimisionipe: where are you with the code deploy? [18:25:01] gehel: Alright. I just restarted wdqs-updater now. Its completed [18:26:07] (03PS4) 10Gehel: Add kafka reporting topic to Puppet config [puppet] - 10https://gerrit.wikimedia.org/r/480894 (owner: 10Smalyshev) [18:27:07] Reedy: BTW, see my q on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/482139 [18:27:55] (03CR) 10Reedy: "From https://phabricator.wikimedia.org/T197616" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482139 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [18:28:06] James_F: original task desc is apparently massively out of date [18:28:10] (03CR) 10Bstorm: "> Patch Set 17:" [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [18:28:38] Reedy: Oh, yes, sorry. [18:28:42] heh [18:28:53] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Volans) @Dzahn it's reported as degraded by megacli: `lang=bash $ sudo /usr/sbin/megacli -LdPdInfo -aAll -NoLog Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Le... [18:28:54] (03Abandoned) 10Jforrester: Set $wgMultiContentRevisionSchemaMigrationStage = SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482139 (https://phabricator.wikimedia.org/T197616) (owner: 10Reedy) [18:30:56] 10Operations, 10Kubernetes: set up a test node with new version, Redis as cache, a new Swift container and export metrics over graphana - https://phabricator.wikimedia.org/T210076 (10fselles) [18:31:12] (03PS1) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [18:32:00] (03CR) 10jerkins-bot: [V: 04-1] Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) (owner: 10Fsero) [18:32:38] (03CR) 10Gehel: [C: 03+2] Add kafka reporting topic to Puppet config [puppet] - 10https://gerrit.wikimedia.org/r/480894 (owner: 10Smalyshev) [18:33:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10Ottomata) [18:33:21] SMalyshev, onimisionipe ^ [18:33:36] gehel: thanks! [18:33:47] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10Ottomata) [18:35:59] cmjohnson1: o/ [18:36:10] 10Operations, 10DBA, 10Data-Services: db1082 power loss resulted on mysql crash - https://phabricator.wikimedia.org/T213108 (10jcrespo) p:05Triage→03High [18:36:10] do you have a min? [18:36:37] elukey I may if it's something simple [18:36:47] as you can imagine I have a backlog of tasks [18:37:06] 10Operations, 10DBA, 10Data-Services: db1082 power loss resulted on mysql crash - https://phabricator.wikimedia.org/T213108 (10jcrespo) a:05Cmjohnson→03jcrespo I plan to take care of this tomorrow morning. [18:37:53] cmjohnson1: ah yes sure - just wanted to say that the disk on an1054 shows up with Firmware state: failed, not sure if you saw it in the past (maybe disk broken or similar) [18:38:03] otherwise I can try something else [18:39:51] Reedy: / James_F this is weird https://test-commons.wikimedia.org/w/index.php?title=User_talk:MarcoAurelio&action=history -- I'm still logged-out and the bot already welcomed me... [18:40:45] Hauskatze: You're in https://test-commons.wikimedia.org/wiki/Special:ListUsers [18:40:49] So something "registered" you [18:41:05] centralauth, but should've logged me in too [18:41:12] I'm up [18:41:13] @elukey okay! Thanks, we will need to order 4TB SATA disks . I don't have any more spares [18:41:29] Hauskatze: Did you refresh? [18:41:36] several times [18:41:42] Odd. [18:42:25] My session is now working after I hit the log-in link, which made the page reload [18:43:04] cmjohnson1: ack thank you! [18:46:26] (03CR) 10Jdlrobson: [C: 03+1] Turn off main page special casing for svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482516 (https://phabricator.wikimedia.org/T213018) (owner: 10Zoranzoki21) [18:48:26] (03PS1) 10Jforrester: TestCommons: Add +importupload for sysops for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482677 [18:48:53] (03CR) 10Dzahn: "i am more than happy to merge design changes that have consensus / agreement from service owners and users, but i can't discuss the design" [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [18:48:55] and disable newusermessage please [18:49:25] Hauskatze: Firefox? [18:49:33] me? no [18:49:42] Hmm. Very surprising then. [18:50:00] James_F: it's also surprising to see locked accounts in the new user log [18:50:07] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): The continuous release pipeline should support more than one service per repo - https://phabricator.wikimedia.org/T210267 (10thcipriani) [18:50:09] and users that have been inactive for ages [18:50:12] (03PS1) 10ArielGlenn: fix tox for obsolete branch at python 2.7 [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482678 [18:50:14] like Mike.lifeguard [18:50:28] (03CR) 10jerkins-bot: [V: 04-1] fix tox for obsolete branch at python 2.7 [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482678 (owner: 10ArielGlenn) [18:50:59] Hauskatze: It's a test wiki. I'm importing content. What do you think happens? :-) [18:51:15] it is importing users as well? [18:51:22] Importing is… messy. [18:51:37] that explains [18:51:51] !log re-deactivate bgp sessions to Zayo on cr1-eqiad - T212791 [18:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:54] T212791: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 [18:52:20] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on analytics1054 - https://phabricator.wikimedia.org/T213038 (10elukey) ` Enclosure Device ID: 32 Slot Number: 1 Drive's position: DiskGroup: 2, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 1 WWN: 500003964b700233 Sequence Number: 3 M... [18:52:45] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 30, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:54:31] (03PS2) 10Jforrester: Enable WBMI on test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482048 [18:56:19] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:57:31] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:58:59] (03PS1) 10MarcoAurelio: Disable NewUserMessage on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482679 [19:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190107T1900) [19:00:04] takidelfin, Zoranzoki21, tgr, and RoanKattouw: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:23] I'm here, and I can do the SWAT [19:00:53] I'm moving my patch to the security window [19:01:07] it needs a lot of testing and the SWAT window is quite full [19:01:20] (03CR) 10Jforrester: [C: 04-1] "We want it to be as close to prod. Commons as possible. Why do you think we should do this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482679 (owner: 10MarcoAurelio) [19:01:24] in that case I am adding a namespace change patch [19:01:47] PROBLEM - HHVM rendering on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [19:01:55] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time [19:01:57] PROBLEM - Apache HTTP on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [19:02:46] (03CR) 10MarcoAurelio: "Because it is an annoying feature that is spamming broken templates in a test wiki for no good reason :) Disabling this has no effects on " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482679 (owner: 10MarcoAurelio) [19:02:56] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 (10Cmjohnson) After the initial optics swap, the link was still not working. I proceeded to swap the optics again (no change) I replaced the patch cable again (no change) replaced the optics one more... [19:02:59] RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 80101 bytes in 0.278 second response time [19:03:09] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.049 second response time [19:03:09] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time [19:03:11] (03PS2) 10Catrope: Enable Flow beta feature on viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482376 (https://phabricator.wikimedia.org/T212929) [19:03:16] (03CR) 10Catrope: [C: 03+2] Enable Flow beta feature on viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482376 (https://phabricator.wikimedia.org/T212929) (owner: 10Catrope) [19:03:18] (03PS2) 10MarcoAurelio: Define $wgMetaNamespace for be.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481520 (https://phabricator.wikimedia.org/T212665) [19:03:39] (03CR) 10VolkerE: [C: 04-1] "I don't think that's a good general direction, even though the colors within the header work well together:" [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [19:04:27] (03Merged) 10jenkins-bot: Enable Flow beta feature on viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482376 (https://phabricator.wikimedia.org/T212929) (owner: 10Catrope) [19:04:31] (03CR) 10Jforrester: [C: 04-1] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482679 (owner: 10MarcoAurelio) [19:04:59] (03PS2) 10ArielGlenn: fix tox for obsolete branch at python 2.7 [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482678 [19:05:07] RoanKattouw: Can you do mine? The bot didn't notice them. [19:05:28] Will do, after mine [19:05:32] Ta. [19:06:19] (03CR) 10Paladox: "@Voker we use a different logo to the movement." [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [19:06:48] I've added a patch after tgr's removal of his. [19:07:18] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): The continuous release pipeline should support more than one service per repo - https://phabricator.wikimedia.org/T210267 (10Ottomata) I'll need multiple service deployments for the same rep... [19:07:35] (03PS1) 10Volans: icinga: fix URLs to dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/482681 (https://phabricator.wikimedia.org/T213052) [19:07:56] (03PS1) 10Catrope: Also add viwikisource to flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482682 (https://phabricator.wikimedia.org/T212929) [19:08:04] (03CR) 10ArielGlenn: [C: 03+2] fix tox for obsolete branch at python 2.7 [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482678 (owner: 10ArielGlenn) [19:08:17] (03PS2) 10Catrope: Also add viwikisource to flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482682 (https://phabricator.wikimedia.org/T212929) [19:08:28] (03CR) 10Catrope: [C: 03+2] Also add viwikisource to flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482682 (https://phabricator.wikimedia.org/T212929) (owner: 10Catrope) [19:08:59] (03PS1) 10Awight: Add fu-berlin.de networks to our poolcounter whitelist [puppet] - 10https://gerrit.wikimedia.org/r/482683 (https://phabricator.wikimedia.org/T210103) [19:09:16] (03CR) 10Awight: [C: 04-1] "DNM until code support is deployed" [puppet] - 10https://gerrit.wikimedia.org/r/482683 (https://phabricator.wikimedia.org/T210103) (owner: 10Awight) [19:09:27] (03Merged) 10jenkins-bot: Also add viwikisource to flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482682 (https://phabricator.wikimedia.org/T212929) (owner: 10Catrope) [19:09:29] (03CR) 10jerkins-bot: [V: 04-1] Add fu-berlin.de networks to our poolcounter whitelist [puppet] - 10https://gerrit.wikimedia.org/r/482683 (https://phabricator.wikimedia.org/T210103) (owner: 10Awight) [19:09:58] (03CR) 10MarcoAurelio: "> > Disabling this has no effects on the behaviour of commons." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482679 (owner: 10MarcoAurelio) [19:10:59] !log Ran emptyUserGroup.php for autoreview, reviewer and editor groups on srwikinews (T212058) [19:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:02] T212058: Drop FlaggedRevs rights from users at srwikinews - https://phabricator.wikimedia.org/T212058 [19:11:07] Hauskatze: You'd be surprised how random it can get. :-( [19:11:13] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/3/1 - https://phabricator.wikimedia.org/T212791 (10ayounsi) Email sent to Zayo NOC. [19:11:19] 10Operations, 10Wikimedia-Site-requests, 10Wikimedia-maintenance-script-run: Drop FlaggedRevs rights from users at srwikinews - https://phabricator.wikimedia.org/T212058 (10Catrope) 05Open→03Resolved a:03Catrope [19:11:26] James_F: I guess, I keep learning [19:11:46] wait what? [19:11:47] Hauskatze: I'll merge your patch after some testing, just not this morning. :-) [19:11:59] If that patch effects real commons, we have bigger problems :) [19:12:24] bawolff: Given the interaction between totally random systems that are what we call "stable" with Commons… [19:13:11] I think a better argument for not doing the patch is if you want testcommons to be as identical as possible [19:13:21] Yes, indeed. [19:13:34] That is not "a better argument", it's my argument. [19:13:35] !log catrope@deploy1001 Synchronized dblists/flow.dblist: Enable Flow on viwikisource (T212929) (duration: 00m 45s) [19:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:38] T212929: Enable StructuredDiscussions on Vietnamese Wikisource - https://phabricator.wikimedia.org/T212929 [19:13:48] James_F: It's just a proposal, and you know what you want the wiki for. Feel free not to merge if you want to keep the wiki as exact as "stable" commons is. [19:13:55] + :) [19:13:59] * James_F grins. [19:14:06] I appreciate that it can be annoying, however. [19:14:43] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Flow beta feature on viwikisource (T212929) (duration: 00m 45s) [19:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:14] (03CR) 10jenkins-bot: Enable Flow beta feature on viwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482376 (https://phabricator.wikimedia.org/T212929) (owner: 10Catrope) [19:15:16] (03CR) 10jenkins-bot: Also add viwikisource to flow.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482682 (https://phabricator.wikimedia.org/T212929) (owner: 10Catrope) [19:15:27] (03CR) 10Reedy: "We can create a really annoying Template:Welcome" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482679 (owner: 10MarcoAurelio) [19:16:49] NewUserMessage exercises a couple code paths that are otherwise not used much [19:16:58] it's definitely good to have as a canary [19:16:58] (03CR) 10Awight: [C: 04-1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/482683 (https://phabricator.wikimedia.org/T210103) (owner: 10Awight) [19:17:15] Given that this is in group0, it'll get tested early (until we delete the wiki in July). [19:17:22] (03CR) 10jerkins-bot: [V: 04-1] Add fu-berlin.de networks to our poolcounter whitelist [puppet] - 10https://gerrit.wikimedia.org/r/482683 (https://phabricator.wikimedia.org/T210103) (owner: 10Awight) [19:17:33] But it only gets tested when accounts get created locally, which won't happen often. [19:17:38] So it's not that good a canary. :-( [19:18:29] you can invite spambots to register there :P [19:18:33] :lol: [19:18:39] You first. [19:18:43] (03PS2) 10Catrope: TestCommons: Add +importupload for sysops for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482677 (owner: 10Jforrester) [19:18:43] 10Operations, 10Performance-Team, 10monitoring, 10Graphite: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10Peter) Sorry back from vacation today. Yep, when I went through the dashboards today, I saw that for some of them there where only some metrics showing. For... [19:18:49] (03CR) 10Catrope: [C: 03+2] TestCommons: Add +importupload for sysops for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482677 (owner: 10Jforrester) [19:19:55] (03Merged) 10jenkins-bot: TestCommons: Add +importupload for sysops for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482677 (owner: 10Jforrester) [19:21:42] James_F: tell that to kowiki :P [19:22:04] Indeed. [19:24:21] James_F: OK, your first patch is on mwdebug1002, pleas etest [19:25:55] RoanKattouw: LGTM. [19:28:05] (03CR) 10Catrope: [C: 03+2] Enable WBMI on test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482048 (owner: 10Jforrester) [19:28:11] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add importupload to sysops on testcommons (duration: 00m 45s) [19:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:35] (03CR) 10jenkins-bot: TestCommons: Add +importupload for sysops for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482677 (owner: 10Jforrester) [19:29:02] (03CR) 10Gergő Tisza: "The JS changes are nice, should probably be a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/482379 (owner: 10Paladox) [19:30:53] (03PS3) 10Catrope: Enable WBMI on test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482048 (owner: 10Jforrester) [19:30:59] (03CR) 10Catrope: Enable WBMI on test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482048 (owner: 10Jforrester) [19:31:02] (03CR) 10Catrope: [C: 03+2] Enable WBMI on test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482048 (owner: 10Jforrester) [19:32:29] (03Merged) 10jenkins-bot: Enable WBMI on test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482048 (owner: 10Jforrester) [19:32:53] (03PS2) 10BryanDavis: toolforge: Add missing php packages [puppet] - 10https://gerrit.wikimedia.org/r/482481 [19:33:04] (03CR) 10BryanDavis: toolforge: Add missing php packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482481 (owner: 10BryanDavis) [19:33:25] RoanKattouw: Oh, hmm. Merge order for that is probably the dblist, then IS, then Wikibase.php. Sorry. [19:34:03] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Release-Engineering-Team, 10Documentation: TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10Ottomata) FYI I added some more notes for the EventGate deployment based on some IRC convos today: https://... [19:34:12] Not Wikibase before IS? [19:34:12] I'm getting CI failures but don't understand the issue: https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/3816/console [19:34:21] That's what I would have expected [19:34:31] RoanKattouw: Oh, maybe. It's just a test wiki. [19:34:49] (ah--never mind my question) [19:35:06] In any case, the whole thing is on mwdebug1002 now, please test [19:35:27] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Dzahn) I don't know what the intended configuration is but it says RAID5 which just needs 3 disks as a minimum. The thing is that no human changed anything as far as we know yet this turned into "degraded" state o... [19:36:10] (03CR) 10Cwhite: [C: 03+1] icinga: fix URLs to dashboard links [puppet] - 10https://gerrit.wikimedia.org/r/482681 (https://phabricator.wikimedia.org/T213052) (owner: 10Volans) [19:37:19] RoanKattouw: LGTM. [19:38:28] (03PS2) 10Awight: Add fu-berlin.de networks to our poolcounter whitelist [puppet] - 10https://gerrit.wikimedia.org/r/482683 (https://phabricator.wikimedia.org/T210103) [19:38:36] !log catrope@deploy1001 Synchronized dblists/wikidata.dblist: Enable Wikidata on testcommonswiki (duration: 00m 44s) [19:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:12] 10Operations, 10Wikimedia-Site-requests, 10Wikimedia-maintenance-script-run: Drop FlaggedRevs rights from users at srwikinews - https://phabricator.wikimedia.org/T212058 (10Zoranzoki21) Thanks @Catrope! Everything is ok! [19:40:39] !log catrope@deploy1001 Synchronized wmf-config/Wikibase.php: Set empty clientDbList for testcommonswiki (duration: 00m 44s) [19:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:36] (03CR) 10jenkins-bot: Enable WBMI on test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482048 (owner: 10Jforrester) [19:41:48] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10awight) Isn't this RAID "50" and therefore needs 6 disks minimum? [19:42:08] (03PS1) 10Jforrester: TestCommons: Throw the 'enable' WBMI switch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482687 [19:42:51] (03PS1) 10BryanDavis: toolforge: remove duplicate stretch python packages [puppet] - 10https://gerrit.wikimedia.org/r/482688 [19:42:56] !log push firewall change to pfw3-codfw/eqiad - T211712 [19:42:58] (03PS3) 10Catrope: Define $wgMetaNamespace for be.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481520 (https://phabricator.wikimedia.org/T212665) (owner: 10MarcoAurelio) [19:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:59] T211712: moving from krypton to grafana1001 broke fundraising dashboards - https://phabricator.wikimedia.org/T211712 [19:43:02] (03CR) 10Catrope: [C: 03+2] Define $wgMetaNamespace for be.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481520 (https://phabricator.wikimedia.org/T212665) (owner: 10MarcoAurelio) [19:43:32] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable WikibaseRepo and WikibaseMediaInfo on testcommonswiki (duration: 00m 44s) [19:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:53] RoanKattouw: Thank you. [19:44:13] (03Merged) 10jenkins-bot: Define $wgMetaNamespace for be.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481520 (https://phabricator.wikimedia.org/T212665) (owner: 10MarcoAurelio) [19:45:24] (03PS4) 10Ottomata: Allow pull based rsync between stat & notebook boxes only [puppet] - 10https://gerrit.wikimedia.org/r/476920 (https://phabricator.wikimedia.org/T205157) [19:45:42] (03PS1) 10ArielGlenn: get rid of subdirs named 'obsolete' [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482689 [19:45:58] (03CR) 10jerkins-bot: [V: 04-1] get rid of subdirs named 'obsolete' [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482689 (owner: 10ArielGlenn) [19:46:44] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set $wgMetaNamespace for bewikibooks (T212665) (duration: 00m 45s) [19:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:47] T212665: Rename be.wikibooks namespace from "Wikibooks:XXX" to "Вікікнігі:XXX" - https://phabricator.wikimedia.org/T212665 [19:47:23] RoanKattouw: I'm sure you read it but please don't forget namespaceDupes :) [19:47:26] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14184/" [puppet] - 10https://gerrit.wikimedia.org/r/476920 (https://phabricator.wikimedia.org/T205157) (owner: 10Ottomata) [19:48:37] (03PS5) 10Catrope: InitialiseSettings: Remove redundant namespace talks definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 (https://phabricator.wikimedia.org/T206952) (owner: 10Takidelfin) [19:49:04] (03CR) 10Catrope: [C: 03+2] InitialiseSettings: Remove redundant namespace talks definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 (https://phabricator.wikimedia.org/T206952) (owner: 10Takidelfin) [19:50:08] (03Merged) 10jenkins-bot: InitialiseSettings: Remove redundant namespace talks definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 (https://phabricator.wikimedia.org/T206952) (owner: 10Takidelfin) [19:51:12] 10Operations, 10ExternalGuidance, 10Traffic: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Paraphrasing a dialogue with @BBlack immediate edge side HTTP redirects based on header/regex might be feasible without fragmenting caches/backends.... [19:51:15] (03PS2) 10ArielGlenn: get rid of subdirs named 'obsolete' [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482689 [19:51:57] (03CR) 10ArielGlenn: [C: 03+2] get rid of subdirs named 'obsolete' [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482689 (owner: 10ArielGlenn) [19:51:59] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove redundant namespace talk definitions (T206952) (duration: 00m 44s) [19:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:02] T206952: Remove redundant wgMetaNamespaceTalk definitions - https://phabricator.wikimedia.org/T206952 [19:52:11] (03Merged) 10jenkins-bot: get rid of subdirs named 'obsolete' [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482689 (owner: 10ArielGlenn) [19:53:53] 10Operations: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10Peachey88) [19:54:47] (03CR) 10jenkins-bot: Define $wgMetaNamespace for be.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481520 (https://phabricator.wikimedia.org/T212665) (owner: 10MarcoAurelio) [19:54:49] (03CR) 10jenkins-bot: InitialiseSettings: Remove redundant namespace talks definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 (https://phabricator.wikimedia.org/T206952) (owner: 10Takidelfin) [19:57:24] (Bond, John Bond, Wikimedia operations :) ) [20:00:58] (03PS1) 10ArielGlenn: remove some one-off scripts and flow rev retriever from obsolete [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482690 [20:03:57] (03CR) 10ArielGlenn: [C: 03+2] remove some one-off scripts and flow rev retriever from obsolete [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482690 (owner: 10ArielGlenn) [20:06:54] (03PS1) 10Ottomata: Refactor mysql::config::client to mariadb::config::client [puppet] - 10https://gerrit.wikimedia.org/r/482693 (https://phabricator.wikimedia.org/T162070) [20:08:44] RoanKattouw: Are you done? [20:09:02] (03PS2) 10Jforrester: TestCommons: Throw the 'enable' WBMI switch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482687 [20:09:05] Yes [20:09:10] (03CR) 10Jforrester: [C: 03+2] TestCommons: Throw the 'enable' WBMI switch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482687 (owner: 10Jforrester) [20:09:14] Cool. I have the conch. [20:09:21] Hauskatze: yeah I ran it, just forgot to log it [20:10:19] (03Merged) 10jenkins-bot: TestCommons: Throw the 'enable' WBMI switch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482687 (owner: 10Jforrester) [20:10:31] (03PS2) 10Ottomata: Refactor mysql::config::client to mariadb::config::client [puppet] - 10https://gerrit.wikimedia.org/r/482693 (https://phabricator.wikimedia.org/T162070) [20:11:53] (03PS1) 10ArielGlenn: mwbzutils is in master and shouldn't be in the obsolete branch [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482694 [20:12:37] (03PS3) 10Ottomata: Refactor mysql::config::client to mariadb::config::client [puppet] - 10https://gerrit.wikimedia.org/r/482693 (https://phabricator.wikimedia.org/T162070) [20:14:27] (03CR) 10ArielGlenn: [C: 03+2] mwbzutils is in master and shouldn't be in the obsolete branch [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482694 (owner: 10ArielGlenn) [20:15:10] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/14187/stat1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/482693 (https://phabricator.wikimedia.org/T162070) (owner: 10Ottomata) [20:16:10] 10Operations, 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, and 2 others: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Ottomata) @Dzahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/482693/ [20:19:47] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: TestCommons: Final go-switch for WBMI Ie52b8af006ba (duration: 00m 45s) [20:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:52] (03PS1) 10ArielGlenn: in 'tools', inwhichfiles might conceivably be useful, rescue it [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482696 [20:21:10] (03CR) 10jenkins-bot: TestCommons: Throw the 'enable' WBMI switch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482687 (owner: 10Jforrester) [20:21:32] (03CR) 10ArielGlenn: [C: 03+2] in 'tools', inwhichfiles might conceivably be useful, rescue it [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482696 (owner: 10ArielGlenn) [20:26:32] (03PS1) 10ArielGlenn: rescue the bz2multistream 'reader' [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482699 [20:27:10] (03CR) 10ArielGlenn: [C: 03+2] rescue the bz2multistream 'reader' [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482699 (owner: 10ArielGlenn) [20:28:14] (03PS1) 10ArielGlenn: remove the redirection file from xmlfileutils [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482700 [20:28:55] (03CR) 10ArielGlenn: [C: 03+2] remove the redirection file from xmlfileutils [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482700 (owner: 10ArielGlenn) [20:53:16] 10Operations, 10Performance-Team (Radar), 10User-Elukey: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10Imarlier) [20:54:46] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): The continuous release pipeline should support more than one service per repo - https://phabricator.wikimedia.org/T210267 (10Ottomata) Q: would blubber's variants be enough to support the ws... [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190107T2100). [21:00:14] ORES is coming [21:00:19] dun dun dun [21:00:52] (03CR) 10GTirloni: [C: 03+2] "I checked out usage statistics and it seems the proposed limits wouldn't cause any trouble. However, we have hosted tools like Discourse t" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [21:00:57] !log awight@deploy1001 Started deploy [ores/deploy@9253beb]: T212530: new ORES models; revscoring 2.3.0 [21:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:59] T212530: Rebuild models for new revscoring (2.3.0) - https://phabricator.wikimedia.org/T212530 [21:05:54] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10Imarlier) [21:06:18] !log mforns@deploy1001 Started deploy [analytics/refinery@faac592]: deploying analytics/refinery to account with refinery-source v0.0.83 [21:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:39] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:12:49] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:13:10] !log mforns@deploy1001 Finished deploy [analytics/refinery@faac592]: deploying analytics/refinery to account with refinery-source v0.0.83 (duration: 06m 52s) [21:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:24] !log awight@deploy1001 Finished deploy [ores/deploy@9253beb]: T212530: new ORES models; revscoring 2.3.0 (duration: 15m 28s) [21:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:27] T212530: Rebuild models for new revscoring (2.3.0) - https://phabricator.wikimedia.org/T212530 [21:16:34] (03PS1) 10ArielGlenn: remove a couple scripts now in puppet or in master [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482707 [21:17:05] (03CR) 10ArielGlenn: [C: 03+2] remove a couple scripts now in puppet or in master [dumps] (obsolete) - 10https://gerrit.wikimedia.org/r/482707 (owner: 10ArielGlenn) [21:17:35] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:17:43] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:19:45] !log push NAT changes to pfw3-eqiad - T211028 [21:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:24] (03PS1) 10Cwhite: add statsd_exporter config to mathoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/482718 (https://phabricator.wikimedia.org/T205870) [21:37:49] (03CR) 10GTirloni: "> Patch Set 17:" [puppet] - 10https://gerrit.wikimedia.org/r/482051 (https://phabricator.wikimedia.org/T209527) (owner: 10GTirloni) [21:39:57] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) 05Open→03Resolved Resolved for now, although at... [21:40:24] (03PS1) 10ArielGlenn: remove stuff now in the obsolete branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/482719 [21:40:41] (03CR) 10jerkins-bot: [V: 04-1] remove stuff now in the obsolete branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/482719 (owner: 10ArielGlenn) [21:41:20] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562 (10Jgreen) [21:47:07] (03CR) 1020after4: phabricator: Migrate mail config to cluster.mailers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482400 (https://phabricator.wikimedia.org/T212989) (owner: 10Paladox) [21:47:41] (03PS2) 10ArielGlenn: remove stuff now in the obsolete branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/482719 [21:48:36] (03CR) 1020after4: [C: 03+1] Phab: Use our custom Priority field value in tooltip on Reports page [puppet] - 10https://gerrit.wikimedia.org/r/455271 (https://phabricator.wikimedia.org/T91428) (owner: 10Aklapper) [21:50:46] (03CR) 10Paladox: phabricator: Migrate mail config to cluster.mailers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482400 (https://phabricator.wikimedia.org/T212989) (owner: 10Paladox) [21:52:00] (03PS3) 10ArielGlenn: remove stuff now in the obsolete branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/482719 [21:52:29] (03CR) 10ArielGlenn: [C: 03+2] remove stuff now in the obsolete branch [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/482719 (owner: 10ArielGlenn) [21:54:13] (03CR) 1020after4: "I don't know, the dumps put a moderate amount of load on the phab box cpu and on the database slave. Running the dump on two servers at th" [puppet] - 10https://gerrit.wikimedia.org/r/437558 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [21:54:51] 10Operations, 10WMF-Legal, 10Wikimedia-Mailing-lists, 10Privacy: Potential privacy violations in emails on mailing lists (links posted in emails to external websites which track users) - https://phabricator.wikimedia.org/T213044 (10Pine) 05Declined→03Open Hi Andre, but I believe that while it's okay an... [21:55:48] (03Abandoned) 10ArielGlenn: pylint and pep8 for scripts related to media tarball creation [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280108 (owner: 10ArielGlenn) [21:57:54] (03Abandoned) 10ArielGlenn: sample uwsgi app that would produce json status output for dumps [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/335007 (https://phabricator.wikimedia.org/T147177) (owner: 10ArielGlenn) [22:00:04] bawolff and Reedy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Weekly Security deployment window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190107T2200). [22:00:04] tgr: A patch you scheduled for Weekly Security deployment window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [22:01:12] Oh, damn, too late for me to fix TestCommons. [22:01:24] (03PS1) 10Jforrester: TestCommons: Re-enable uploading of files, accidentally prevented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482721 [22:02:18] tgr: Are Security team people around to merge your patch? [22:02:36] no, I'll do it [22:02:41] OK. [22:02:44] needs a bunch of testing [22:02:49] just finishing up something else [22:02:59] Mind if I sling out my one right now? [22:03:05] I'll be ~ 2 mins. [22:04:18] sure [22:04:39] (03CR) 10Jforrester: [C: 03+2] TestCommons: Re-enable uploading of files, accidentally prevented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482721 (owner: 10Jforrester) [22:05:44] (03Merged) 10jenkins-bot: TestCommons: Re-enable uploading of files, accidentally prevented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482721 (owner: 10Jforrester) [22:06:20] (03PS3) 10Mforns: Update analytics eventlogging_to_druid_job.pp to mirror changes in scala job [puppet] - 10https://gerrit.wikimedia.org/r/479847 (https://phabricator.wikimedia.org/T210099) [22:07:03] (03CR) 10jenkins-bot: TestCommons: Re-enable uploading of files, accidentally prevented [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482721 (owner: 10Jforrester) [22:08:02] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: TestCommons: Re-enable uploading of files, accidentally prevented (duration: 00m 44s) [22:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:00] tgr: All yours. [22:09:08] thx [22:09:19] (03PS1) 10Ppchelko: Increase MW -> EventBus service HTTP request timeout. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482722 (https://phabricator.wikimedia.org/T204183) [22:13:36] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops: Stronger DKIM key for fundraising emails? - https://phabricator.wikimedia.org/T210445 (10Jgreen) [22:15:48] (03PS4) 10Gergő Tisza: Make password policy code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 [22:20:37] (03CR) 10Gergő Tisza: [C: 03+2] Make password policy code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [22:20:54] (03CR) 10Gergő Tisza: [C: 04-2] Make password policy code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [22:21:48] (03CR) 10Gergő Tisza: [C: 04-2] "Actually, let's test this via cherry-pick instead." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [22:27:11] (03CR) 10Ottomata: [C: 03+1] Increase MW -> EventBus service HTTP request timeout. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482722 (https://phabricator.wikimedia.org/T204183) (owner: 10Ppchelko) [22:27:21] (03PS1) 10Mforns: Bump up refinery_version in refine.pp to v0.0.83 [puppet] - 10https://gerrit.wikimedia.org/r/482727 [22:54:31] 10Operations, 10DBA, 10Data-Services: db1082 power loss resulted on mysql crash - https://phabricator.wikimedia.org/T213108 (10Marostegui) Maybe it is worth to start replication on db1082 (not on sanitarium), let it catch up, once it is synced compare.py it against the host you will reimage it from to make s... [22:57:09] 10Operations, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) p:05Triage→03Normal [22:59:27] 10Operations, 10netops: Increase network capacity (2018-19 Q3 Goal) - https://phabricator.wikimedia.org/T213122 (10ayounsi) p:05Triage→03Normal [23:00:29] 10Operations, 10ops-eqsin, 10netops: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [23:00:31] 10Operations, 10netops: Increase network capacity (2018-19 Q3 Goal) - https://phabricator.wikimedia.org/T213122 (10ayounsi) [23:00:33] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [23:02:20] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10mmodell) If gerrit only shows 16x16 images, I would argue that it's not even worth doing. At that size, avatars will be difficult to distinguish and that will minimize any benefit... [23:12:11] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 611.81 seconds [23:21:59] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 274.93 seconds [23:26:37] (03PS5) 10Gergő Tisza: Make password policy code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 [23:38:33] 10Operations, 10WMF-Legal, 10Wikimedia-Mailing-lists, 10Privacy: Potential privacy violations in emails on mailing lists (links posted in emails to external websites which track users) - https://phabricator.wikimedia.org/T213044 (10Bawolff) 05Open→03Declined Just set the mailing list to not allow html... [23:43:59] (03PS3) 10Awight: Add fu-berlin.de networks to our poolcounter whitelist [puppet] - 10https://gerrit.wikimedia.org/r/482683 (https://phabricator.wikimedia.org/T210103) [23:45:45] 10Operations, 10WMF-Legal, 10Wikimedia-Mailing-lists, 10Privacy: Potential privacy violations in emails on mailing lists (links posted in emails to external websites which track users) - https://phabricator.wikimedia.org/T213044 (10Bawolff) As an addendum: > Also, if creating links that use third party tr... [23:50:58] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Paladox) Gerrit does not show 16x16, depends on where you are in the interface, the image is larger on the settings page, but only a bit smaller on the change screen. See https://g... [23:52:38] (03PS6) 10Gergő Tisza: Make password policy code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 [23:58:05] PROBLEM - High lag on wdqs1007 is CRITICAL: 3661 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen