[00:00:06] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181120T0000). [00:00:06] niedzielski: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:44] (03CR) 10EBernhardson: Set default elasticsearch cluster name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474807 (owner: 10EBernhardson) [00:01:07] o/ [00:01:42] Hello [00:01:44] Only one patch today! [00:01:51] That'll make me feel better about adding my own [00:02:04] (03CR) 10Catrope: [C: 032] Prod: increase Schema.org page split test to 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [00:02:25] hello! [00:02:35] (03CR) 10Andrew Bogott: Set default elasticsearch cluster name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474807 (owner: 10EBernhardson) [00:03:10] (03Merged) 10jenkins-bot: Prod: increase Schema.org page split test to 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [00:04:34] (03PS2) 10Andrew Bogott: shinken: temporarily remove monitoring for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/474758 (https://phabricator.wikimedia.org/T208101) [00:04:36] (03CR) 10Ayounsi: "Not sure why this has been assigned to me?" [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [00:04:39] niedzielski: Your patch is on mwdebug1002, please test? [00:04:56] RoanKattouw: will do, thank you [00:05:27] (03CR) 10Alex Monk: "I was assuming Valentin would do it. Maybe he thinks only you should be touching librenms stuff?" [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [00:06:13] (03CR) 10Andrew Bogott: [C: 032] shinken: temporarily remove monitoring for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/474758 (https://phabricator.wikimedia.org/T208101) (owner: 10Andrew Bogott) [00:07:39] (03PS1) 10Alex Monk: deployment-prep: Update BounceHandler deployment-mx02 IP for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474823 (https://phabricator.wikimedia.org/T208101) [00:07:43] i see the changes and am testing now [00:09:55] RoanKattouw: looks good. please sync [00:10:42] (03CR) 10Andrew Bogott: [C: 032] deployment-prep: Update BounceHandler deployment-mx02 IP for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474823 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [00:11:06] andrewbogott, wait a sec for the other deployment in progress :) [00:12:36] (03PS4) 10Andrew Bogott: Set default elasticsearch cluster name [puppet] - 10https://gerrit.wikimedia.org/r/474807 (owner: 10EBernhardson) [00:13:44] (03CR) 10Andrew Bogott: [C: 032] Set default elasticsearch cluster name [puppet] - 10https://gerrit.wikimedia.org/r/474807 (owner: 10EBernhardson) [00:15:24] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Increase Schema.org page split test to 100% sampling (T208755) (duration: 00m 48s) [00:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:28] T208755: Launch A/B test for sameAs property - https://phabricator.wikimedia.org/T208755 [00:16:42] niedzielski: Done [00:17:08] RoanKattouw: cool, I'm checking it again now [00:21:38] (03PS3) 10Cwhite: initial commit [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) [00:21:42] (03PS1) 10Thcipriani: Add jenkins-agent user to releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/474824 [00:21:44] (03PS1) 10Thcipriani: Install docker on releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/474825 (https://phabricator.wikimedia.org/T208529) [00:22:00] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10jijiki) [00:22:04] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [00:23:27] RoanKattouw: looks good to me. thank you! [00:23:55] !log registering librenms IRC bot [00:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:27] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.4/resources/src/mediawiki.rcfilters/ui/mw.rcfilters.ui.FilterTagMultiselectWidget.js: RCFilters bug fix (T209657) (duration: 00m 47s) [00:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:30] T209657: [wmf.4-regression] RC/Watchlist - red outline for 'Filter changes' area - https://phabricator.wikimedia.org/T209657 [00:28:06] RoanKattouw: Any chance to deploy some updated logos for a small Wiktionary project during this window? [00:28:13] Sure, throw em at me [00:29:25] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358 (10Ijon) Hello. I seem to have run up against this problem. This seems like a serious barrier to contributing core content (media files into Co... [00:31:39] PROBLEM - SSH cp1073.mgmt on cp1073.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:59] (03PS1) 10Odder: Correct logos for the Sindhi Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474826 [00:33:53] RoanKattouw: That's the one [00:39:55] (03PS3) 10CRusnov: Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [00:40:35] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358 (10Tgr) If this really is a Pywikibot problem, it probably helps with prioritization if the relevant project is tagged. At a glance though pywi... [00:43:40] (03CR) 10CRusnov: Make the puppetdb backend process primitive types for queries. (035 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [00:44:23] (03CR) 10jerkins-bot: [V: 04-1] Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [00:45:14] yay [00:49:03] librenms-wmf: hello [00:49:33] 23* 28[18librenms-wmf28] is logged in as 18librenms-wmf [00:49:34] nice [00:51:42] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/474826/ <-- that's the one I was hoping to deploy tonight [00:51:43] !log Gerrit: added Jeena Huneidi to wmf-deployers (T209722) [00:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:47] T209722: Onboarding Jeena Huneidi - https://phabricator.wikimedia.org/T209722 [00:52:31] (03CR) 10Catrope: [C: 032] Correct logos for the Sindhi Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474826 (owner: 10Odder) [00:53:32] (03Merged) 10jenkins-bot: Correct logos for the Sindhi Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474826 (owner: 10Odder) [00:55:32] !log catrope@deploy1001 Synchronized static/images/project-logos/: Correct logos for Sindhi Wiktionary (duration: 00m 47s) [00:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:31] (03PS1) 10Bstorm: sonofgridengine: Fix tests and errors in the script [puppet] - 10https://gerrit.wikimedia.org/r/474827 (https://phabricator.wikimedia.org/T200557) [00:56:32] odder: Deployed, and URLs purged [00:57:48] RoanKattouw: Looks fabulous, thanks very much, they'll be very happy :) [00:57:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.40 seconds [00:58:29] (03PS4) 10CRusnov: Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [00:59:33] (03CR) 10Bstorm: [C: 032] sonofgridengine: Fix tests and errors in the script [puppet] - 10https://gerrit.wikimedia.org/r/474827 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [01:00:39] Krenair: still have to puppetize the password, etc [01:00:50] but working on it, more after dinner [01:00:57] (03CR) 10jerkins-bot: [V: 04-1] Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [01:02:15] (03CR) 10Dzahn: "this patch probably needs more reviewers and be SWAT deployed afterwards" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) (owner: 10MarcoAurelio) [01:05:26] (03Abandoned) 10Dzahn: stdlib: import useful data types (filemode,filesource,fqdn,host,port) [puppet] - 10https://gerrit.wikimedia.org/r/472363 (owner: 10Dzahn) [01:09:21] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358 (10zhuyifei1999) Pywikibot does support chunked uploading, this task is about why such chunked uploading must be using async mode, and if it mus... [01:14:47] (03CR) 10Dzahn: [C: 031] analytics_cluster::webserver: apache -> httpd module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [01:20:07] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358 (10zhuyifei1999) On a side note, there are few tools/scripts that support async chunked uploading (UW, Rillke's script, and v2c are the only one... [01:30:24] (03PS8) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [01:37:59] (03CR) 10Dzahn: "this works on thorium, but you have moved some sites to other hosts meanwhile,. right? i only see the following sites on it: analytics.w" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [01:38:47] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 265.06 seconds [01:44:40] (03PS9) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [01:54:56] (03PS10) 10Dzahn: analytics_cluster::webserver: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [01:56:55] (03CR) 10Dzahn: "change should be simpler now and affect fewer sites and only thorium.." [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [02:03:46] (03PS1) 10Bstorm: sonofgridengine: Read in horrible configuration output [puppet] - 10https://gerrit.wikimedia.org/r/474831 (https://phabricator.wikimedia.org/T200557) [02:06:04] (03PS1) 10Dzahn: hadoop::ui: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474832 [02:06:43] (03CR) 10Bstorm: [C: 032] sonofgridengine: Read in horrible configuration output [puppet] - 10https://gerrit.wikimedia.org/r/474831 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [02:13:46] (03PS1) 10Ayounsi: LibreNMS IRC bot identify to Nickserv [puppet] - 10https://gerrit.wikimedia.org/r/474833 (https://phabricator.wikimedia.org/T209841) [02:15:48] (03PS1) 10Dzahn: turnilo: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474834 [02:16:38] (03CR) 10jerkins-bot: [V: 04-1] turnilo: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474834 (owner: 10Dzahn) [02:19:22] (03PS1) 10Ayounsi: Add fake password for LibreNMS IRC bot [labs/private] - 10https://gerrit.wikimedia.org/r/474835 (https://phabricator.wikimedia.org/T209841) [02:19:38] (03PS2) 10Dzahn: turnilo: migrate from apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/474834 [02:21:07] (03CR) 10Ayounsi: [V: 032 C: 032] Add fake password for LibreNMS IRC bot [labs/private] - 10https://gerrit.wikimedia.org/r/474835 (https://phabricator.wikimedia.org/T209841) (owner: 10Ayounsi) [02:21:35] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/13599/analytics-tool1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/474834 (owner: 10Dzahn) [02:22:51] (03CR) 10Ayounsi: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13600/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/474833 (https://phabricator.wikimedia.org/T209841) (owner: 10Ayounsi) [02:23:10] (03PS2) 10Ayounsi: LibreNMS IRC bot identify to Nickserv [puppet] - 10https://gerrit.wikimedia.org/r/474833 (https://phabricator.wikimedia.org/T209841) [02:40:32] (03CR) 10Ayounsi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [02:48:40] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/13601/analytics-tool1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/474832 (owner: 10Dzahn) [03:07:37] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [03:10:05] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [03:10:37] (03CR) 10BBlack: [C: 031] Bird anycast DNS, add BFD multicast support [puppet] - 10https://gerrit.wikimedia.org/r/474819 (owner: 10Ayounsi) [03:17:28] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [03:17:55] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Dzahn) I think i did httpd setup for pretty much all of those except mx. Happy to help deploying more of them. [03:19:44] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:19:52] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [03:20:19] ^ T209395 [03:20:20] T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 [03:24:26] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [03:27:02] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms [03:29:51] (03PS11) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [03:31:35] (03CR) 10Mathew.onipe: "Off to testing in deployment-maps05" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [03:41:00] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [03:54:11] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1552.23 seconds [04:13:23] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 259.22 seconds [06:17:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474841 (https://phabricator.wikimedia.org/T86339) [06:19:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474841 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:20:03] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474841 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [06:20:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474842 [06:21:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1082 - T86339 (duration: 00m 52s) [06:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:33] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [06:21:34] !log Deploy schema change on db1082 - T86339 [06:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:19] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474842 (owner: 10Marostegui) [06:25:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474842 (owner: 10Marostegui) [06:28:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1082 - T86339 (duration: 00m 47s) [06:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:16] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [06:28:18] !log Deploy schema change on db1070 (s5 master) - T86339 [06:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:52] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358 (10Ijon) (@Tgr - my concern is for end-users not using Pywikibot, but the videoconvert tool. Thanks to @zhuyifei1999 and @Harej I learned that... [06:51:44] 10Operations, 10MediaWiki-Uploading, 10Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358 (10Ijon) (unfortunately, videoconvert does not seem to be tracking issues here on Phabricator...) [07:12:58] (03CR) 10Vgutierrez: "@Ayounsi I wanted to keep you in the loop for this one :)" [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [07:15:14] (03CR) 10Vgutierrez: librenms: Use certcentral cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [07:35:21] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13602/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [07:45:49] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:47:11] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) p:05Triage>03Normal [07:49:36] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) ipmi-sel ` 11 | Nov-16-2018 | 16:40:53 | MSR Info Log | OEM Reserved | OEM Event Offset = 00h 12 | Nov-16-2018 | 16:45:48 | CPU Machine Chk | Processor |... [08:00:41] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:43] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [08:04:35] sigh [08:10:09] (03CR) 10Alexandros Kosiaris: [C: 031] Bird anycast DNS, add BFD multicast support [puppet] - 10https://gerrit.wikimedia.org/r/474819 (owner: 10Ayounsi) [08:11:41] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [08:16:24] (03PS2) 10Muehlenhoff: Absent unused Diamond collector for ldap/corp [puppet] - 10https://gerrit.wikimedia.org/r/474698 (https://phabricator.wikimedia.org/T183454) [08:19:12] (03CR) 10Muehlenhoff: [C: 032] Absent unused Diamond collector for ldap/corp [puppet] - 10https://gerrit.wikimedia.org/r/474698 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:20:24] (03PS1) 10Jcrespo: mariadb: Depool es2011 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474845 [08:21:01] PROBLEM - very high load average likely xfs on ms-be2044 is CRITICAL: CRITICAL - load average: 224.60, 114.83, 48.83 [08:22:30] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Krenair) Okay, first one is {T209856} and its next patch is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474743/ followed by https://gerrit.wikimedia.org/r/#/c... [08:24:44] (03PS2) 10Muehlenhoff: Remove Diamond from openldap/corp servers [puppet] - 10https://gerrit.wikimedia.org/r/474699 (https://phabricator.wikimedia.org/T183454) [08:24:52] ms-be2044 is me [08:27:29] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from openldap/corp servers [puppet] - 10https://gerrit.wikimedia.org/r/474699 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:31:46] (03PS2) 10Muehlenhoff: Disable Diamond on WDQS hosts [puppet] - 10https://gerrit.wikimedia.org/r/474695 (https://phabricator.wikimedia.org/T183454) [08:32:51] (03CR) 10Muehlenhoff: [C: 032] Disable Diamond on WDQS hosts [puppet] - 10https://gerrit.wikimedia.org/r/474695 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [08:34:31] (03PS12) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce/Hive/Oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [08:38:25] (03PS13) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce/Hive/Oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [08:46:07] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2011 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474845 (owner: 10Jcrespo) [08:47:12] (03Merged) 10jenkins-bot: mariadb: Depool es2011 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474845 (owner: 10Jcrespo) [08:48:20] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) [08:48:56] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Depool es2011 (duration: 00m 47s) [08:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:29] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:53:41] !log upgrade and reboot es2011 [08:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:45] (03PS1) 10Marostegui: mariadb: Get ready to decom pc2004,pc2005,pc2006 [puppet] - 10https://gerrit.wikimedia.org/r/474847 (https://phabricator.wikimedia.org/T209858) [08:56:57] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:57:35] PROBLEM - puppet last run on wdqs1010 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [08:58:03] ^ that's me [08:59:13] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [09:01:36] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/13605/" [puppet] - 10https://gerrit.wikimedia.org/r/474847 (https://phabricator.wikimedia.org/T209858) (owner: 10Marostegui) [09:02:10] (03CR) 10Marostegui: [C: 032] mariadb: Get ready to decom pc2004,pc2005,pc2006 [puppet] - 10https://gerrit.wikimedia.org/r/474847 (https://phabricator.wikimedia.org/T209858) (owner: 10Marostegui) [09:02:41] RECOVERY - puppet last run on wdqs1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:03:43] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [09:04:16] !log Remove pc2004, pc2005 and pc2006 from tendril and zarcillo - T209858 [09:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:20] T209858: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 [09:05:03] !log powercycle elastic2021 [09:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:27] PROBLEM - Host elastic2021 is DOWN: PING CRITICAL - Packet loss = 100% [09:07:57] RECOVERY - MD RAID on elastic2021 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:08:01] RECOVERY - Host elastic2021 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [09:08:48] mhhh expected reboot? ^ [09:09:31] godog: see gehel log above :) [09:09:47] RECOVERY - puppet last run on elastic2021 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:09:48] godog: half expected [09:10:06] ah nevermind, missed the !log, thanks [09:11:51] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [09:12:03] RECOVERY - Check systemd state on elastic2021 is OK: OK - running: The system is fully operational [09:12:59] !log Stop MySQL on pc2004, pc2005 and pc2006 for decommission - T209858 [09:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:03] T209858: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 [09:13:22] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2011 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474848 [09:14:02] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2011 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474848 (owner: 10Jcrespo) [09:14:46] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) [09:15:04] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2011 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474848 (owner: 10Jcrespo) [09:15:39] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10Marostegui) a:05Marostegui>03RobH These hosts are now ready for DCOps to take over. MySQL has been stopped on them too. [09:19:53] (03PS1) 10Jcrespo: mariadb: Depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474850 [09:21:14] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474850 (owner: 10Jcrespo) [09:21:34] PROBLEM - very high load average likely xfs on ms-be2045 is CRITICAL: CRITICAL - load average: 283.09, 136.85, 53.56 [09:22:14] (03Merged) 10jenkins-bot: mariadb: Depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474850 (owner: 10Jcrespo) [09:22:46] yes yes, load known [09:23:05] !log stress-test new ms-be hardware - T209395 [09:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:09] T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 [09:23:11] 10Operations, 10Traffic, 10Patch-For-Review: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) Yesterday all ATS hosts ran out of disk space. That's due to trafficserver logging several messages like the following: File:/var/log/trafficserver/notpurge.pipe was closed, have... [09:23:40] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool es2011, depool es2014 (duration: 00m 46s) [09:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:08] PROBLEM - very high load average likely xfs on ms-be2048 is CRITICAL: CRITICAL - load average: 283.22, 136.90, 53.62 [09:24:21] I've silenced those now [09:25:25] (03CR) 10Elukey: "Andrew: after trying to work on profile::hadoop::common to adapt it on this patch, I realized that the code that I wrote was really cumber" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [09:25:57] !log upgrade and reboot es2014 [09:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:12] !log Deploy schema change on s2 codfw master (db2035) with replication - T86339 [09:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:15] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [09:28:00] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [09:32:02] (03PS1) 10Muehlenhoff: Remove server-board [puppet] - 10https://gerrit.wikimedia.org/r/474855 (https://phabricator.wikimedia.org/T183454) [09:32:27] (03CR) 10jerkins-bot: [V: 04-1] Remove server-board [puppet] - 10https://gerrit.wikimedia.org/r/474855 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:32:53] (03PS2) 10Muehlenhoff: Remove server-board [puppet] - 10https://gerrit.wikimedia.org/r/474855 (https://phabricator.wikimedia.org/T183454) [09:34:22] (03PS1) 10Elukey: [WIP] Allow extra security parameters for hadoop/oozie/hive profiles [puppet] - 10https://gerrit.wikimedia.org/r/474856 [09:34:32] !log Deploy schema change on s2 hosts: dbstore1002, db1090:3312 and db1095:3312 - T86339 [09:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:36] T86339: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 [09:35:03] (03CR) 10Arturo Borrero Gonzalez: "I wonder which effect this patch may, since all of our 'version' is now 'mitaka', right?" [puppet] - 10https://gerrit.wikimedia.org/r/474795 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [09:35:07] (03CR) 10Elukey: "Andrew: the correspondent puppet patch would be https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474856/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [09:35:54] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474857 [09:36:08] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) @Papaul have you seen this before? the host is not in service, you can take it down for troubleshooting at any time. unfortunately mcelog doesn't seem very helpful either: ` root@ms-be2047:~# j... [09:45:14] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:46:47] (03PS1) 10Muehlenhoff: Remove ApacheStatusSimpleCollector [puppet] - 10https://gerrit.wikimedia.org/r/474864 (https://phabricator.wikimedia.org/T183454) [09:58:28] (03PS1) 10Muehlenhoff: Remove PyBalStateCollector [puppet] - 10https://gerrit.wikimedia.org/r/474865 (https://phabricator.wikimedia.org/T183454) [10:00:03] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Doc: add repoConceptBaseUri comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [10:00:14] (03CR) 10Volans: [C: 04-1] "Few things that needs to be change plus some possible improvements inline." (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [10:00:57] (03PS1) 10Filippo Giunchedi: prometheus: update tools k8s config [puppet] - 10https://gerrit.wikimedia.org/r/474866 (https://phabricator.wikimedia.org/T209893) [10:02:49] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474857 (owner: 10Jcrespo) [10:03:55] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474857 (owner: 10Jcrespo) [10:04:08] (03CR) 10Filippo Giunchedi: [C: 032] "This should be enough at least to start the instance back up, likely more tweaks on the Prometheus and/or k8s sides are needed." [puppet] - 10https://gerrit.wikimedia.org/r/474866 (https://phabricator.wikimedia.org/T209893) (owner: 10Filippo Giunchedi) [10:06:28] (03PS1) 10Jcrespo: mariadb: Depool es2018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474869 [10:11:26] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474869 (owner: 10Jcrespo) [10:11:45] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [10:12:23] (03Abandoned) 10Muehlenhoff: Configure Kerberos support for Druid (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474188 (owner: 10Muehlenhoff) [10:12:33] (03Merged) 10jenkins-bot: mariadb: Depool es2018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474869 (owner: 10Jcrespo) [10:13:36] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Hello903hello) > In T209693#4757863, @ArielGlenn wrote: > > So why does yue.wikipedia go to zh-yue.wikipedia? T10217 and T30441 Initially created at zh-yue... [10:13:57] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool es2014, depool es2018 (duration: 00m 46s) [10:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:29] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1094.eqiad.wmnet'... [10:17:09] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) >>! In T207192#4757504, @elukey wrote: > @Cmjohnson the Debian OS install is in progress, but I think that an-worker109[45] h... [10:17:26] !log upgrade and reboot es2018 [10:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:11] (03PS2) 10Filippo Giunchedi: prometheus: update tools k8s config [puppet] - 10https://gerrit.wikimedia.org/r/474866 (https://phabricator.wikimedia.org/T209893) [10:22:30] (03PS2) 10Banyek: mariadb: productionize dbproxy1015 and dbproxy1016 [puppet] - 10https://gerrit.wikimedia.org/r/473546 (https://phabricator.wikimedia.org/T202367) [10:24:37] RECOVERY - very high load average likely xfs on ms-be2044 is OK: OK - load average: 0.06, 4.76, 77.52 [10:27:27] (03PS12) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [10:30:48] (03PS19) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [10:31:45] (03CR) 10GTirloni: [C: 032] toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [10:32:56] (03CR) 10GTirloni: "Thanks for the review, Arturo & Brooke!" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [10:34:51] (03PS1) 10Banyek: mariadb: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) [10:35:01] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474875 [10:36:01] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:36:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:37:23] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [10:37:37] (03PS2) 10Banyek: mariadb: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) [10:38:55] RECOVERY - very high load average likely xfs on ms-be2045 is OK: OK - load average: 0.01, 5.01, 78.05 [10:41:26] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474875 (owner: 10Jcrespo) [10:42:01] RECOVERY - very high load average likely xfs on ms-be2048 is OK: OK - load average: 0.24, 4.54, 75.20 [10:42:07] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [10:42:28] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474875 (owner: 10Jcrespo) [10:43:31] (03CR) 10Gehel: [C: 031] "LGTM, will merge later today" [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [10:44:20] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1093 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:45:44] (03PS3) 10Banyek: mariadb: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) [10:47:28] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool es2018 (duration: 00m 46s) [10:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:45] (03CR) 10Marostegui: [C: 031] mariadb: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:47:57] (03CR) 10Gehel: "There is still something wrong here, the problem is deeper than what it seems." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474807 (owner: 10EBernhardson) [10:48:14] !log depooling db1093 [10:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:19] !log depooling db1093 (T85757) [10:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:22] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:48:43] (03CR) 10Banyek: [C: 032] mariadb: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:49:31] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [10:49:56] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1094.eqiad.wmnet', 'an-worker1095.eqiad.wmnet'] ` and were **ALL**... [10:50:07] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [10:51:14] (03PS1) 10Gehel: Revert "Set default elasticsearch cluster name" [puppet] - 10https://gerrit.wikimedia.org/r/474878 [10:51:45] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1093 (duration: 00m 47s) [10:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "Set default elasticsearch cluster name" [puppet] - 10https://gerrit.wikimedia.org/r/474878 (owner: 10Gehel) [10:52:59] PROBLEM - Long running screen/tmux on cp2021 is CRITICAL: connect to address 10.192.48.25 port 5666: Connection refused [10:54:03] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [10:54:09] (03PS2) 10Gehel: Revert "Set default elasticsearch cluster name" [puppet] - 10https://gerrit.wikimedia.org/r/474878 [10:54:11] (03PS1) 10Gehel: elasticsearch: cluster_hosts shoudl be FQDN, not just host names [puppet] - 10https://gerrit.wikimedia.org/r/474879 [10:56:25] banyek: are you pooling or repooling? [10:56:36] your last 2 logs contradict [10:56:52] depooling [10:57:06] first the phab. id was not there [10:57:07] [11:51] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1093 (duration: 00m 47s) [10:57:08] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:57:20] the keyboard was too sensitive, and I accidentally hit enter with my pinky [10:57:28] no problem [10:57:33] what I do when that happens [10:57:42] is logging the clarification afterwards [10:59:22] oh, the scap one [10:59:23] sure [10:59:42] !log db1093 was depooled wrong message sent [10:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:10] !log stop and upgrade db2086 [11:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:17] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [11:00:23] I am upgrading some codfw outdated hosts [11:00:27] FYI [11:01:02] (03CR) 10jenkins-bot: Enable WelcomeSurvey on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474761 (https://phabricator.wikimedia.org/T209725) (owner: 10Catrope) [11:01:04] (03CR) 10jenkins-bot: Enable and configure Welcome survey on kowiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474331 (https://phabricator.wikimedia.org/T209725) (owner: 10Sbisson) [11:01:06] (03CR) 10jenkins-bot: Prod: increase Schema.org page split test to 100% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473227 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [11:01:08] (03CR) 10jenkins-bot: deployment-prep: Update BounceHandler deployment-mx02 IP for migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474823 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [11:01:10] (03CR) 10jenkins-bot: Correct logos for the Sindhi Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474826 (owner: 10Odder) [11:01:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474841 (https://phabricator.wikimedia.org/T86339) (owner: 10Marostegui) [11:01:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474842 (owner: 10Marostegui) [11:01:16] (03CR) 10jenkins-bot: mariadb: Depool es2011 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474845 (owner: 10Jcrespo) [11:01:18] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2011 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474848 (owner: 10Jcrespo) [11:01:20] (03CR) 10jenkins-bot: mariadb: Depool es2014 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474850 (owner: 10Jcrespo) [11:01:22] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2014 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474857 (owner: 10Jcrespo) [11:01:25] (03CR) 10jenkins-bot: mariadb: Depool es2018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474869 (owner: 10Jcrespo) [11:01:26] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2018 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474875 (owner: 10Jcrespo) [11:01:29] (03CR) 10jenkins-bot: mariadb: depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474874 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [11:05:57] !log executing schema change on db1093 (T85757) [11:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:00] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:08:07] (03CR) 10DCausse: "it's unclear in the various commit messages what was the original problem we tried to solve." [puppet] - 10https://gerrit.wikimedia.org/r/474878 (owner: 10Gehel) [11:11:53] !log repooling db1093 (T85757) [11:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:56] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [11:12:31] (03PS1) 10Banyek: Revert "mariadb: depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474882 [11:14:17] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474882 (owner: 10Banyek) [11:15:00] (03CR) 10jenkins-bot: Revert "mariadb: depool db1093" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474882 (owner: 10Banyek) [11:16:10] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: (now really) repool db1093 (duration: 00m 47s) [11:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:29] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:27:24] !log stop and upgrade db2087 [11:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:05] (03PS1) 10Banyek: mariadb: depooling db1113 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474883 (https://phabricator.wikimedia.org/T85757) [11:28:36] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) 05Open>03Resolved [11:32:05] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) In the facebook's gh issue the following was mentioned: >So the recommended solution i... [11:40:17] (03PS1) 10Mobrovac: RunSingleJob: Check that JobExecutor has been loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) [11:40:52] (03PS1) 10Muehlenhoff: Remove absented Diamond collector for NTP [puppet] - 10https://gerrit.wikimedia.org/r/474887 (https://phabricator.wikimedia.org/T183454) [11:41:29] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [11:44:21] (03CR) 10jerkins-bot: [V: 04-1] RunSingleJob: Check that JobExecutor has been loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) (owner: 10Mobrovac) [11:45:26] (03PS2) 10Mobrovac: RunSingleJob: Check that JobExecutor has been loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) [11:48:37] (03PS1) 10Ladsgroup: Add federation configs for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) [11:55:19] (03PS1) 10Alex Monk: deployment-prep: Update cache-upload private IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474890 (https://phabricator.wikimedia.org/T208101) [11:55:53] !log rolling reboot of proton hosts for kernel security update [11:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:27] (03PS1) 10Alex Monk: openstack: remove out of date deployment-cache-upload04 IPs [puppet] - 10https://gerrit.wikimedia.org/r/474891 (https://phabricator.wikimedia.org/T208101) [11:58:19] (03CR) 1020after4: [C: 031] openstack: remove out of date deployment-cache-upload04 IPs [puppet] - 10https://gerrit.wikimedia.org/r/474891 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181120T1200). [12:00:05] Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:13] I can SWAT today [12:00:32] Yes sir :) [12:00:55] I'm gonna have some stuff to throw in if that's okay zeljkof [12:01:01] deployment-prep migration things [12:01:10] Krenair: sure, want to deploy yourself? [12:01:25] I haven't had deployment access for almost two years now [12:01:35] ah [12:01:42] in that case, I can deploy [12:03:19] Krenair: please add your patches to the calendar [12:03:19] cool [12:03:22] yeah will do [12:03:53] Zoranzoki21: so, this one first? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/472791 [12:04:32] (03PS1) 10Alex Monk: deployment-prep: Update deployment-db* IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474892 (https://phabricator.wikimedia.org/T208101) [12:05:20] Wait [12:05:47] Yes [12:06:51] ok [12:07:05] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472791 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:07:17] Zoranzoki21: then https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/472792 next? [12:07:45] zeljkof, done [12:07:57] Yes zeljkof [12:08:10] (03PS1) 10GTirloni: toolforge: Move ::clush::target to base.pp [puppet] - 10https://gerrit.wikimedia.org/r/474894 (https://phabricator.wikimedia.org/T209701) [12:08:29] Zoranzoki21: conflict for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/472792 [12:08:41] gerrit can not resolve automatically :/ [12:08:51] I should rebase? [12:08:58] please do [12:10:08] I am back [12:10:27] (03Merged) 10jenkins-bot: Upload HD logos for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472791 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:12:28] 10Operations, 10DBA: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) >>! In T208320#4738830, @Marostegui wrote: > I have eased replication consistency flags and it is now catching up. > What do you mean with "it is not compressed"? that you are running the alter tables to compres... [12:13:19] Doing rebase [12:13:22] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:472791|Upload HD logos for multiple projects (T150618)]] (duration: 00m 48s) [12:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:26] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [12:13:36] (03CR) 10Arturo Borrero Gonzalez: "I think the lookup() cleanup deserves a patch on his own :-)" [puppet] - 10https://gerrit.wikimedia.org/r/474894 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [12:13:52] (03PS5) 10Zoranzoki21: Use HD logos in InitialiseSettings.php for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472792 (https://phabricator.wikimedia.org/T150618) [12:14:02] Zoranzoki21: 472791 deployed [12:14:33] (03CR) 10jerkins-bot: [V: 04-1] Use HD logos in InitialiseSettings.php for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472792 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:15:29] (03PS1) 10Zoranzoki21: Use HD logos in InitialiseSettings.php for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474897 (https://phabricator.wikimedia.org/T150618) [12:15:34] (03CR) 10jenkins-bot: Upload HD logos for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472791 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:16:11] (03PS2) 10GTirloni: toolforge: Move ::clush::target to base.pp [puppet] - 10https://gerrit.wikimedia.org/r/474894 (https://phabricator.wikimedia.org/T209701) [12:16:33] (03PS3) 10GTirloni: toolforge: Move ::clush::target to base.pp [puppet] - 10https://gerrit.wikimedia.org/r/474894 (https://phabricator.wikimedia.org/T209701) [12:17:07] (03CR) 10GTirloni: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/474894 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [12:17:23] (03PS2) 10Zfilipin: Add tboverride permission to extendedmover group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474458 (https://phabricator.wikimedia.org/T209753) (owner: 10Zoranzoki21) [12:17:36] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474458 (https://phabricator.wikimedia.org/T209753) (owner: 10Zoranzoki21) [12:18:06] (03PS2) 10Zoranzoki21: Use HD logos in InitialiseSettings.php for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474897 (https://phabricator.wikimedia.org/T150618) [12:18:15] (03CR) 10GTirloni: [C: 032] toolforge: Move ::clush::target to base.pp [puppet] - 10https://gerrit.wikimedia.org/r/474894 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [12:18:31] (03PS2) 10Ladsgroup: Add federation configs for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) [12:18:50] Once you all are done, let me know and I have a noop deployment (beta only) [12:18:51] (03Abandoned) 10Zoranzoki21: Use HD logos in InitialiseSettings.php for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472792 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:19:03] !log powercycling db2087, stuck on reboot [12:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:08] Amir1: sure [12:19:14] Thanks [12:20:06] (03PS1) 1020after4: Update deployment-db3 and -db4 to new IPS in 172.16.5 (eqiad1-r) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474898 (https://phabricator.wikimedia.org/T208101) [12:20:46] (03CR) 1020after4: [C: 031] Update deployment-db3 and -db4 to new IPS in 172.16.5 (eqiad1-r) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474898 (https://phabricator.wikimedia.org/T208101) (owner: 1020after4) [12:21:16] (03Merged) 10jenkins-bot: Add tboverride permission to extendedmover group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474458 (https://phabricator.wikimedia.org/T209753) (owner: 10Zoranzoki21) [12:21:54] zeljkof: mwdebug1002? [12:22:14] 474458 is at mwdebug1002 [12:22:39] testing.. [12:23:11] (03CR) 1020after4: [C: 032] "oops I submitted the same change, will just merge yours ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474892 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [12:23:19] For me looks good, but check logs if there are not errors [12:23:59] (03Abandoned) 1020after4: Update deployment-db3 and -db4 to new IPS in 172.16.5 (eqiad1-r) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474898 (https://phabricator.wikimedia.org/T208101) (owner: 1020after4) [12:24:14] Zoranzoki21: nothing strange in logs, deploying [12:24:25] PROBLEM - very high load average likely xfs on ms-be2049 is CRITICAL: CRITICAL - load average: 297.17, 296.88, 295.76 [12:24:31] PROBLEM - very high load average likely xfs on ms-be2046 is CRITICAL: CRITICAL - load average: 295.94, 296.26, 295.48 [12:24:39] PROBLEM - very high load average likely xfs on ms-be2044 is CRITICAL: CRITICAL - load average: 296.43, 296.60, 295.72 [12:24:49] PROBLEM - very high load average likely xfs on ms-be2048 is CRITICAL: CRITICAL - load average: 296.76, 296.86, 295.85 [12:24:51] PROBLEM - very high load average likely xfs on ms-be2045 is CRITICAL: CRITICAL - load average: 296.22, 296.55, 295.80 [12:24:53] PROBLEM - very high load average likely xfs on ms-be2050 is CRITICAL: CRITICAL - load average: 297.28, 296.74, 295.66 [12:25:07] zeljkof: Ok. ms-be* is not related to me? [12:25:17] I don't think so [12:25:25] (03PS3) 10Zoranzoki21: Use HD logos in InitialiseSettings.php for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474897 (https://phabricator.wikimedia.org/T150618) [12:25:38] zeljkof: Ok, next patch is https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/474897/ [12:25:54] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:474458|Add tboverride permission to extendedmover group on enwiki (T209753)]] (duration: 00m 47s) [12:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:58] T209753: Add tboverride permission to extendedmover group on enwiki - https://phabricator.wikimedia.org/T209753 [12:26:05] Zoranzoki21: 474458 deployed [12:26:43] Ok, thanks zeljkof. Next is https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/474897/ [12:27:14] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474897 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:27:18] Zoranzoki21: ok [12:28:15] (03Merged) 10jenkins-bot: Use HD logos in InitialiseSettings.php for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474897 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:28:27] (03CR) 10jenkins-bot: Add tboverride permission to extendedmover group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474458 (https://phabricator.wikimedia.org/T209753) (owner: 10Zoranzoki21) [12:28:32] (03CR) 10jenkins-bot: Use HD logos in InitialiseSettings.php for multiple projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474897 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [12:28:46] (03PS1) 10Alex Monk: deployment-prep: Update parsoid09 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474899 (https://phabricator.wikimedia.org/T208101) [12:29:05] Zoranzoki21: 474897 is at mwdebug1002 [12:29:08] twentyafterfour, zeljkof: mind if I add one more? [12:29:20] Krenair: sure, we are only limited by time [12:29:23] zeljkof: Looks good, no need to be tested. [12:29:57] Zoranzoki21: ok, deploying [12:30:05] thanks, done [12:30:40] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:474897|Use HD logos in InitialiseSettings.php for multiple projects (T150618)]] (duration: 00m 48s) [12:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:43] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [12:31:04] Zoranzoki21: deployed [12:31:42] Ok, then deploy https://gerrit.wikimedia.org/r/c/472745/ and it`s all from me [12:31:47] Zoranzoki21: merge conflict for 472745, please rebase [12:33:07] zeljkof: working on it [12:33:57] Krenair: can you test your changes at mwdebug1002? or should I deploy them without testing? [12:35:07] (03PS5) 10Zoranzoki21: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 (https://phabricator.wikimedia.org/T209251) [12:35:30] zeljkof: I have to make new PS again [12:35:46] (03Abandoned) 10Zoranzoki21: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [12:35:57] zeljkof, they're all deployment-prep only [12:36:03] there should not be any difference in production [12:36:14] (03PS1) 10Zoranzoki21: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474900 (https://phabricator.wikimedia.org/T209251) [12:36:40] Krenair: ok, so I just merge and deploy then? [12:37:02] yes [12:38:38] zeljkof: I will move this change in next SWAT [12:38:41] *my change [12:38:44] Zoranzoki21: ok [12:38:53] zeljkof: I going in school now, cya [12:38:54] Krenair: I'll let you know when I deploy all changes [12:39:03] thanks zeljkof [12:39:26] (03Abandoned) 10Zoranzoki21: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474900 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [12:40:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474890 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [12:41:00] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) Ok, I have read all the dang back tckets and thanks everyone for their comments. I am skipping wikisource and betawikiversity because it's more... [12:41:11] (03PS1) 10ArielGlenn: add redirects of various zh-yue projects to yue [puppet] - 10https://gerrit.wikimedia.org/r/474901 (https://phabricator.wikimedia.org/T209693) [12:41:48] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) Needless to say I'd like a bunch of eyes on this to make sure it's right. Thanks in advance. [12:42:43] (03Merged) 10jenkins-bot: deployment-prep: Update cache-upload private IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474890 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [12:42:47] (03Merged) 10jenkins-bot: deployment-prep: Update deployment-db* IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474892 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [12:45:38] !log zfilipin@deploy1001 Synchronized wmf-config/reverse-proxy-staging.php: SWAT: [[gerrit:474890|deployment-prep: Update cache-upload private IP (T208101)]] (duration: 00m 45s) [12:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:41] T208101: Migrate deployment-prep to eqiad1 - https://phabricator.wikimedia.org/T208101 [12:47:04] twentyafterfour: I see you've merged 474892, it's eu swat time, I'll deploy it [12:47:48] zeljkof: ok sorry I didn't notice swat was ongoing when I merged it. [12:48:19] twentyafterfour: no problem [12:49:47] !log zfilipin@deploy1001 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [12:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:12] twentyafterfour, Krenair uh oh, scap failed for 474892 [12:51:08] zeljkof: that's odd since it should only affect beta [12:51:17] what did it say? [12:51:34] I'm not sure if it's related, but fatal monitor had a big spike :/ [12:51:40] ErrorException from line 269 of /srv/mediawiki/php-1.33.0-wmf.4/extensions/PagedTiffHandler/PagedTiffHandler_body.php: PHP Notice: Undefined index: 227 [12:51:40] huh... [12:51:49] that shouldn't be related [12:52:09] and looks like it's back to normal... I'll wait another minute [12:52:25] twentyafterfour, I have to go to a lecture shortly, mind taking care of this? [12:52:54] twentyafterfour: should I try re-deploying? fatal monitor is back to normal, looks like it was just bad timing [12:53:36] Krenair: no prob [12:53:38] thanks [12:53:51] (03CR) 10jenkins-bot: deployment-prep: Update cache-upload private IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474890 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [12:53:53] !log setting innodb_flush_log_at_trx_commit to 2 on dbstore2002 (T208320) [12:53:55] (03CR) 10jenkins-bot: deployment-prep: Update deployment-db* IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474892 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [12:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:56] T208320: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 [12:54:06] zeljkof: yeah I think it might have just been unfortunate timing [12:54:19] ok, I'll try the deploy once more [12:54:20] it's almost certainly not caused by that patch [12:55:08] !log setting innodb_flush_log_at_trx_commit to 2 on dbstore2002 (s3 instance only!) (T208320) [12:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:22] !log zfilipin@deploy1001 Synchronized wmf-config/db-labs.php: SWAT: [[gerrit:474892|deployment-prep: Update deployment-db* IPs (T208101)]] (duration: 00m 47s) [12:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:25] T208101: Migrate deployment-prep to eqiad1 - https://phabricator.wikimedia.org/T208101 [12:55:30] twentyafterfour: no problems this time [12:55:37] thanks zeljkof! [12:56:24] (03CR) 10Zfilipin: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474899 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181120T1300) [13:00:16] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10aborrero) Ping, this is blocking the parent task about Trusty deprecation. [13:00:49] (03PS1) 10Elukey: Add the Hadoop worker nodes' racking awareness config [puppet] - 10https://gerrit.wikimedia.org/r/474904 (https://phabricator.wikimedia.org/T209929) [13:01:08] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474899 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [13:01:19] 10Operations, 10DBA: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) as the replication lag was 69663 seconds we agreed to set `innodb_flush_log_at_trx_commit=2;` on the host. Now the replication is catching up. [13:01:30] We are in the maintenance window of labdsb1011, I start to depool the host [13:01:49] banyek: just a minute, I have one last patch merging for eu swat [13:01:51] (03PS2) 10Elukey: Add the Hadoop worker nodes' racking awareness config [puppet] - 10https://gerrit.wikimedia.org/r/474904 (https://phabricator.wikimedia.org/T209929) [13:02:01] I am holding back my horses [13:02:09] thanks! [13:02:10] (03Merged) 10jenkins-bot: deployment-prep: Update parsoid09 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474899 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [13:03:24] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:474899|deployment-prep: Update parsoid09 IP (T208101)]] (duration: 00m 47s) [13:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:28] T208101: Migrate deployment-prep to eqiad1 - https://phabricator.wikimedia.org/T208101 [13:03:36] !log EU SWAT finished [13:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:50] Amir1: sorry, swat took the entire window :( [13:04:00] no worries! [13:04:00] banyek: I'm done, thanks for waiting [13:04:07] thank you! [13:04:44] (03PS1) 10GTirloni: toolforge: Use profile::openstack::main::clientlib in clush::master [puppet] - 10https://gerrit.wikimedia.org/r/474906 (https://phabricator.wikimedia.org/T209701) [13:06:00] !log depooling labsdb1011 (T209517) [13:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:04] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [13:06:06] (03CR) 10GTirloni: [C: 032] toolforge: Use profile::openstack::main::clientlib in clush::master [puppet] - 10https://gerrit.wikimedia.org/r/474906 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [13:06:47] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/474751 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [13:06:49] (03CR) 10jenkins-bot: deployment-prep: Update parsoid09 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474899 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [13:07:23] (03PS2) 10Banyek: wiki replicas: depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/474751 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [13:07:27] (03CR) 10Banyek: [V: 032 C: 032] wiki replicas: depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/474751 (https://phabricator.wikimedia.org/T209517) (owner: 10Bstorm) [13:08:10] (03PS1) 10Elukey: Set hive.auto.convert.join to false in hive-site.xml [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474907 (https://phabricator.wikimedia.org/T209536) [13:09:33] (03CR) 10Joal: [C: 031] "Thank you for that elukey :)" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474907 (https://phabricator.wikimedia.org/T209536) (owner: 10Elukey) [13:10:01] (03CR) 10Elukey: [V: 032 C: 032] Set hive.auto.convert.join to false in hive-site.xml [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474907 (https://phabricator.wikimedia.org/T209536) (owner: 10Elukey) [13:11:33] (03PS1) 10Elukey: Update cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/474908 [13:12:12] (03CR) 10Elukey: [C: 032] Update cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/474908 (owner: 10Elukey) [13:13:24] twentyafterfour, from the recoveries it sounds like that fixed some stuff [13:14:02] Krenair: yes that plus I just fixed scap on deployment-mediawiki-07 (removed the bogus mwdeploy user) [13:14:10] cool [13:14:23] indeed [13:15:05] for some reason deployment-cache-upload04 doesn't appear to be 100% happy [13:15:13] it's redirecting stuff to themselves [13:15:16] at least at the top level [13:17:37] Labsdb1011 got depooled, but we still have 2 long running queries there: [13:17:44] zeljkof: what banyek was doing didn't involve SWAT, as it is done via puppet [13:17:56] both running more than 9000 seconds, I kill them [13:18:17] marostegui: ah, good to know, I just wanted to make sure :) it was just a minute of overlap [13:18:31] Yep, it is also for banyek to know :-) [13:18:39] (The limit of long running queries is 14400 seconds normally) [13:19:24] marostegui: I knew they won't interfere, but even they happen separately I thought 'let's not overcomplicate this day, as it is only a few minutes' [13:19:37] sure :) [13:19:47] !log stop and upgrade db2082 [13:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:00] (03PS1) 10Ema: ATS: avoid verify_config for now [puppet] - 10https://gerrit.wikimedia.org/r/474909 (https://phabricator.wikimedia.org/T204225) [13:20:39] (03PS3) 10Andrew Bogott: Revert "Set default elasticsearch cluster name" [puppet] - 10https://gerrit.wikimedia.org/r/474878 (owner: 10Gehel) [13:20:43] (03PS2) 10Ema: ATS: avoid verify_config for now [puppet] - 10https://gerrit.wikimedia.org/r/474909 (https://phabricator.wikimedia.org/T204225) [13:22:00] (03CR) 10Andrew Bogott: [C: 032] Revert "Set default elasticsearch cluster name" [puppet] - 10https://gerrit.wikimedia.org/r/474878 (owner: 10Gehel) [13:22:19] (03PS2) 10Andrew Bogott: elasticsearch: cluster_hosts shoudl be FQDN, not just host names [puppet] - 10https://gerrit.wikimedia.org/r/474879 (owner: 10Gehel) [13:22:56] (03CR) 10Andrew Bogott: [C: 032] elasticsearch: cluster_hosts shoudl be FQDN, not just host names [puppet] - 10https://gerrit.wikimedia.org/r/474879 (owner: 10Gehel) [13:23:07] PROBLEM - MariaDB Slave IO: s8 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2082.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2082.codfw.wmnet (111 Connection refused) [13:23:26] (03CR) 10Ema: [C: 032] ATS: avoid verify_config for now [puppet] - 10https://gerrit.wikimedia.org/r/474909 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [13:23:33] (03PS3) 10Ema: ATS: avoid verify_config for now [puppet] - 10https://gerrit.wikimedia.org/r/474909 (https://phabricator.wikimedia.org/T204225) [13:25:11] ^that is me restarting its master [13:29:28] weird [13:29:40] on labsdb1011 there was only one package to upgrade: libmariadbclient18 [13:30:27] andrewbogott: thanks for the merges! Let's hope it actually fixes the issue! [13:30:49] it did! [13:30:54] kool! [13:31:03] took me forever to find it! [13:32:34] restarting labsdb1011 [13:35:37] (03PS1) 10Hashar: hhvm: fix typo in RUN_AS_GROUP [puppet] - 10https://gerrit.wikimedia.org/r/474910 (https://phabricator.wikimedia.org/T209946) [13:35:58] banyek: that is unlikely, as both the kernel and mariadb package had to be upgraded [13:36:35] maybe it was done beforehand? [13:37:40] but if yes, who ran the mysql_upgrade script? [13:38:56] labsdb1011 had mariadb 10.1.35, the latest version is 1.37 [13:39:07] RECOVERY - MariaDB Slave IO: s8 on db2094 is OK: OK slave_io_state Slave_IO_Running: Yes [13:39:14] jynus: I see 10.1.37 installed [13:39:24] I guess the full-upgrade was done by cloud team? [13:39:38] and it is pending the reboot + mysql restart + mysql_upgrade [13:39:39] ? [13:39:46] prbably [13:39:59] I run the mysql_upgrade anyway [13:40:19] (03PS2) 10Volans: Add administrative module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473796 (https://phabricator.wikimedia.org/T205884) [13:40:21] (03PS9) 10Volans: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) [13:40:23] (03PS3) 10Volans: Add Puppet module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473735 (https://phabricator.wikimedia.org/T205884) [13:40:35] https://phabricator.wikimedia.org/P7830 [13:41:11] (03CR) 10Volans: "all done, please accept the fact that I didn't split the rename of the username property to a separate commit." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473796 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:41:21] banyek jynus ^ [13:42:00] (03CR) 10Volans: "done" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473735 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [13:43:40] arturo: running full-upgrade on a database can be dangerous -it can be done, but has to be done properly to avoid risks [13:43:54] marostegui: yes, that was the only one there [13:44:05] e.g. different versions of running server and uts utilities can result in issues [13:44:45] banyek: did you ran mysql_upgrade, was it ran? [13:45:01] it is currently running [13:45:08] ack. I wasn't sure who was meant to run the upgrade [13:45:11] ok, so it was not ran un upgrade [13:45:20] jynus: check my paste above, the mariadb package wasn't installed [13:45:34] I mean, it was, but when mysql was stopped, it was installed by banyek [13:45:42] (03CR) 10Hydriz: [C: 031] add redirects of various zh-yue projects to yue [puppet] - 10https://gerrit.wikimedia.org/r/474901 (https://phabricator.wikimedia.org/T209693) (owner: 10ArielGlenn) [13:46:15] I only installed the libmariadbclient18 [13:46:30] marostegui: based on your paste it was installed by arturo [13:46:33] as it was ony only one package showed up as upgradable [13:46:48] yes, you guys are right, I read my own paste wrong :) [13:46:50] happily mysql_upgrade was not run, which could have created issues [13:47:40] arturo: the line is- we take care of the infrastructure- you take care of the "service" [13:47:56] jynus: but it is running currently [13:48:07] jynus: cool [13:48:09] banyek: but that is fine [13:48:18] arturo: we are ok with you upgrading it, but some precautions have to be taken [13:48:25] (just for the record. ;) [14:04:36] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:06:32] (03PS14) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce/Hive/Oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [14:09:52] (03PS3) 10Ayounsi: Bird anycast DNS, add BFD multihop support [puppet] - 10https://gerrit.wikimedia.org/r/474819 [14:10:41] (03PS1) 10Hashar: hhvm: add basic specs [puppet] - 10https://gerrit.wikimedia.org/r/474915 [14:13:51] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review: hhvm systemd service on deployment-prep reports: hhvm.service: Ignoring invalid environment assignment 'RUN_AS_GROUP=www-data - https://phabricator.wikimedia.org/T209946 (10hashar) + #operations since that affects production as well (... [14:15:33] (03PS3) 10Alex Monk: librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) [14:15:44] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [14:16:23] (03CR) 10jerkins-bot: [V: 04-1] librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [14:16:51] (03CR) 10jerkins-bot: [V: 04-1] librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [14:17:04] RECOVERY - very high load average likely xfs on ms-be2045 is OK: OK - load average: 0.00, 4.65, 77.71 [14:17:10] RECOVERY - very high load average likely xfs on ms-be2044 is OK: OK - load average: 0.22, 4.98, 78.74 [14:17:56] RECOVERY - very high load average likely xfs on ms-be2049 is OK: OK - load average: 0.06, 4.91, 78.71 [14:18:08] RECOVERY - very high load average likely xfs on ms-be2048 is OK: OK - load average: 0.16, 4.40, 76.09 [14:18:10] RECOVERY - very high load average likely xfs on ms-be2046 is OK: OK - load average: 0.15, 4.23, 75.05 [14:18:38] (03PS4) 10Alex Monk: librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) [14:18:38] RECOVERY - very high load average likely xfs on ms-be2050 is OK: OK - load average: 0.18, 4.50, 76.35 [14:21:57] (03PS1) 10Hashar: hhvm: test default file generation [puppet] - 10https://gerrit.wikimedia.org/r/474917 (https://phabricator.wikimedia.org/T209946) [14:22:18] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): hhvm systemd service on deployment-prep reports: hhvm.service: Ignoring invalid environment assignment 'RUN_AS_GROUP=www-data - https://phabricator.wikimedia.org/T209946 (10hashar) a:03hashar [14:23:15] mysql upgrade process finished on labsdb1011, now I'll get it repooled soon [14:23:23] (03CR) 10Hashar: "Spotted on deployment-prep. The two child changes add rspec and a spec testing generation of /etc/default/hhvm" [puppet] - 10https://gerrit.wikimedia.org/r/474910 (https://phabricator.wikimedia.org/T209946) (owner: 10Hashar) [14:23:26] (03CR) 10Elukey: [C: 032] Introduce new security directives for Yarn/HDFS/MapReduce/Hive/Oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [14:24:07] (03CR) 10Hashar: "Compile all three classes. I had to add several tweaks to the spec_helper. Works for me locally on stretch and CI is apparently happy." [puppet] - 10https://gerrit.wikimedia.org/r/474915 (owner: 10Hashar) [14:24:36] (03CR) 10Hashar: "Covers the typo in RUN_AS_GROUP which I have fixed with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474910/" [puppet] - 10https://gerrit.wikimedia.org/r/474917 (https://phabricator.wikimedia.org/T209946) (owner: 10Hashar) [14:25:57] (03PS2) 10Elukey: Allow extra security parameters for hadoop/oozie/hive profiles [puppet] - 10https://gerrit.wikimedia.org/r/474856 [14:32:10] !log stop and upgrade db2033 [14:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:29] (03CR) 10Volans: "Minor nitpick inline. Regarding the failing grammar I've found the issue, let's talk offline about it." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [14:37:12] (03PS13) 10Gehel: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [14:37:33] (03CR) 10Filippo Giunchedi: [C: 032] Remove server-board [puppet] - 10https://gerrit.wikimedia.org/r/474855 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [14:37:52] (03CR) 10Filippo Giunchedi: [C: 032] Remove PyBalStateCollector [puppet] - 10https://gerrit.wikimedia.org/r/474865 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [14:41:04] (03CR) 10Vgutierrez: [C: 031] "pcc seems to be happy: https://puppet-compiler.wmflabs.org/compiler1002/13608/" [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [14:41:33] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 25835 MB (5% inode=99%) [14:42:11] (03CR) 10Gehel: [C: 032] profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [14:43:11] !log puppet temp disable on es2001 for data transfer work [14:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:26] (03PS7) 10Gehel: elasticsearch: create multiple elasticsearch instances on cirrus codfw [puppet] - 10https://gerrit.wikimedia.org/r/473258 (https://phabricator.wikimedia.org/T207918) [14:43:36] (03PS3) 10Elukey: Allow extra security parameters for hadoop/oozie/hive profiles [puppet] - 10https://gerrit.wikimedia.org/r/474856 [14:45:31] (03CR) 10Gehel: [C: 032] elasticsearch: create multiple elasticsearch instances on cirrus codfw [puppet] - 10https://gerrit.wikimedia.org/r/473258 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [14:53:06] RECOVERY - Long running screen/tmux on cp2021 is OK: OK: No SCREEN or tmux processes detected. [14:53:14] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow serving content from php7 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/471232 (https://phabricator.wikimedia.org/T206338) [14:53:23] <_joe_> elukey: I'm merging this ^^ [14:53:50] (03CR) 10Ayounsi: [C: 032] Bird anycast DNS, add BFD multihop support [puppet] - 10https://gerrit.wikimedia.org/r/474819 (owner: 10Ayounsi) [14:54:08] (03PS4) 10Ayounsi: Bird anycast DNS, add BFD multihop support [puppet] - 10https://gerrit.wikimedia.org/r/474819 [14:54:22] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: allow serving content from php7 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/471232 (https://phabricator.wikimedia.org/T206338) (owner: 10Giuseppe Lavagetto) [14:54:39] <_joe_> sorry :P [14:55:15] !log libthumbor_1.3.2-0+wmf1+stretch1 uploaded to stretch-wikimedia T209886 [14:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:17] T209886: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 [14:56:02] (03PS5) 10Ayounsi: Bird anycast DNS, add BFD multihop support [puppet] - 10https://gerrit.wikimedia.org/r/474819 [14:56:43] (03PS3) 10Muehlenhoff: Remove server-board [puppet] - 10https://gerrit.wikimedia.org/r/474855 (https://phabricator.wikimedia.org/T183454) [14:58:28] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-jijiki: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [14:58:51] (03CR) 10Ayounsi: [V: 032 C: 032] Bird anycast DNS, add BFD multihop support [puppet] - 10https://gerrit.wikimedia.org/r/474819 (owner: 10Ayounsi) [14:59:00] (03PS6) 10Ayounsi: Bird anycast DNS, add BFD multihop support [puppet] - 10https://gerrit.wikimedia.org/r/474819 [14:59:34] (03CR) 10Ayounsi: [V: 032 C: 032] Bird anycast DNS, add BFD multihop support [puppet] - 10https://gerrit.wikimedia.org/r/474819 (owner: 10Ayounsi) [14:59:46] XioNoX: don't do V+2 [15:00:47] paravoid: I rebased it several times in the last few minutes and the time jenkins validate it, someone else already submited a new change [15:01:21] (03PS1) 10Muehlenhoff: Also drop the dashboard definition for server-board [puppet] - 10https://gerrit.wikimedia.org/r/474920 [15:01:23] and each time jenkins was doing V+2 [15:01:47] and this time as well [15:02:15] (03CR) 10Muehlenhoff: [C: 032] Also drop the dashboard definition for server-board [puppet] - 10https://gerrit.wikimedia.org/r/474920 (owner: 10Muehlenhoff) [15:02:51] !log Add BFD multihop support to Bird anycast DNS [15:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:57] (03CR) 10Gehel: [C: 031] "LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473796 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:03:17] (03PS4) 10Elukey: Allow extra security parameters for hadoop/oozie/hive profiles [puppet] - 10https://gerrit.wikimedia.org/r/474856 [15:06:41] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/473735 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:08:55] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) a:03Papaul [15:11:00] RECOVERY - Disk space on elastic1025 is OK: DISK OK [15:11:48] (03Abandoned) 10Mathew.onipe: maps: change nodes.bin owner to osmupdater [puppet] - 10https://gerrit.wikimedia.org/r/474680 (https://phabricator.wikimedia.org/T209569) (owner: 10Mathew.onipe) [15:12:08] !log enable bfd traceoptions on cr1-codfw [15:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:26] (03PS1) 10Muehlenhoff: Disable Diamond on Graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/474922 (https://phabricator.wikimedia.org/T183454) [15:15:55] Cron /usr/local/bin/elasticsearch_hot_threads_logger.py a lot of cronspam, can someone look? I am in a meeting [15:16:51] (03PS1) 10GTirloni: labpuppetmaster: Resolve .wmflabs addresses [puppet] - 10https://gerrit.wikimedia.org/r/474923 (https://phabricator.wikimedia.org/T177959) [15:16:56] apergos: that's expected, nothing to worry about [15:17:06] ok but it is 50 messages every 5 minutes [15:17:17] gehel: ^^ [15:17:18] so if it could be made a little quieter that would be nice [15:17:21] yeah, I complained about the same :) [15:17:26] thanks! [15:17:42] apergos, vgutierrez yep, my fault, I'm on it [15:17:48] great [15:17:59] and apologies for the noise [15:19:03] (03PS5) 10Elukey: Allow extra security parameters for hadoop/oozie/hive profiles [puppet] - 10https://gerrit.wikimedia.org/r/474856 [15:19:14] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13613/" [puppet] - 10https://gerrit.wikimedia.org/r/474856 (owner: 10Elukey) [15:20:08] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:20:10] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:20:10] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:20:50] !log installing libopenmpt security updates [15:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:58] (03CR) 10Andrew Bogott: [C: 031] labpuppetmaster: Resolve .wmflabs addresses [puppet] - 10https://gerrit.wikimedia.org/r/474923 (https://phabricator.wikimedia.org/T177959) (owner: 10GTirloni) [15:21:20] (03PS2) 10Andrew Bogott: openstack: remove out of date deployment-cache-upload04 IPs [puppet] - 10https://gerrit.wikimedia.org/r/474891 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [15:22:00] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:22:03] (03CR) 10Andrew Bogott: [C: 032] openstack: remove out of date deployment-cache-upload04 IPs [puppet] - 10https://gerrit.wikimedia.org/r/474891 (https://phabricator.wikimedia.org/T208101) (owner: 10Alex Monk) [15:22:25] (03PS1) 10Banyek: wmf-pt-kill: disabling ssl connection [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/474924 [15:22:39] (03CR) 10Volans: [C: 032] Add administrative module (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473796 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:23:50] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:23:50] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2030 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:23:50] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:23:50] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:23:50] PROBLEM - Elasticsearch HTTPS for production-search-psi-codfw on elastic2022 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:24:01] (03Merged) 10jenkins-bot: Add administrative module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473796 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:24:16] gehel: expected? ^^^^ [15:24:45] onimisionipe ^^ [15:24:59] volans: it's "expected" [15:25:08] 10Operations, 10ops-codfw, 10DC-Ops: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T208838 (10jcrespo) 05Open>03Resolved [15:25:16] dcausse: ack, thanks [15:25:25] volans: expected, maybe not, but those are new instances, silencing now [15:25:26] gehel is on it (it's a new unused cluster) [15:25:39] yeah I thought it was the multi-instance one [15:26:18] PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird [15:26:23] (03CR) 10Jcrespo: [C: 031] wmf-pt-kill: disabling ssl connection [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/474924 (owner: 10Banyek) [15:26:24] PROBLEM - Check systemd state on dns2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:26:36] thanks [15:27:24] (03PS1) 10Bstorm: sonofgridengine: one last bug -- works now [puppet] - 10https://gerrit.wikimedia.org/r/474925 (https://phabricator.wikimedia.org/T200557) [15:27:27] ema, vgutierrez: bird: /etc/bird/bird.conf, line 6: Unable to open log file `/tmp/bird-debug.log': Permission denied [15:27:34] on dns2001 [15:27:42] uh [15:27:45] (03PS1) 10Gehel: elasticsearch: only configured filtered instances on elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/474926 [15:27:48] (03CR) 10Banyek: [C: 032] wmf-pt-kill: disabling ssl connection [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/474924 (owner: 10Banyek) [15:27:51] (03CR) 10Banyek: [V: 032 C: 032] wmf-pt-kill: disabling ssl connection [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/474924 (owner: 10Banyek) [15:27:55] actually XioNoX [15:27:57] ^^^ [15:28:05] I see you merged something related [15:28:31] volans: yeah, I'm working on it [15:28:31] thx [15:29:32] RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird [15:29:40] RECOVERY - Check systemd state on dns2001 is OK: OK - running: The system is fully operational [15:29:43] (03PS2) 10Gehel: elasticsearch: only configured filtered instances on elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/474926 [15:30:09] (03CR) 10Vgutierrez: [C: 032] librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [15:30:18] (03PS5) 10Vgutierrez: librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [15:30:36] (03CR) 10DCausse: elasticsearch: only configured filtered instances on elastic nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474926 (owner: 10Gehel) [15:31:41] (03CR) 10Bstorm: [C: 032] sonofgridengine: one last bug -- works now [puppet] - 10https://gerrit.wikimedia.org/r/474925 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:32:17] (03PS3) 10Gehel: elasticsearch: only configured filtered instances on elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/474926 [15:32:35] (03CR) 10Gehel: elasticsearch: only configured filtered instances on elastic nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474926 (owner: 10Gehel) [15:32:46] (03PS6) 10Vgutierrez: librenms: Use certcentral cert [puppet] - 10https://gerrit.wikimedia.org/r/474743 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [15:32:51] (03PS4) 10Gehel: elasticsearch: only configured filtered instances on elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/474926 [15:34:32] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/474795 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:35:00] (03CR) 10Gehel: [C: 032] elasticsearch: only configured filtered instances on elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/474926 (owner: 10Gehel) [15:35:09] (03PS5) 10Gehel: elasticsearch: only configured filtered instances on elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/474926 [15:35:14] (03CR) 10Gehel: [V: 032 C: 032] elasticsearch: only configured filtered instances on elastic nodes [puppet] - 10https://gerrit.wikimedia.org/r/474926 (owner: 10Gehel) [15:35:47] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/474795 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:36:12] (03CR) 10Andrew Bogott: [C: 031] "This seems safe to me." [puppet] - 10https://gerrit.wikimedia.org/r/474795 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:36:23] !log add test term allow BFD multihop on cr1-codfw loopback4 filter [15:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:19] !log switching to certcentral managed TLS certificate for librenms.wikimedia.org - T209856 [15:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:22] T209856: Deploy a certcentral managed TLS certificate for librenms - https://phabricator.wikimedia.org/T209856 [15:39:09] (03CR) 10Ppchelko: [C: 031] RunSingleJob: Check that JobExecutor has been loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) (owner: 10Mobrovac) [15:39:47] alex@alex-laptop:~$ openssl s_client -connect librenms.wikimedia.org:443 2>&1 | openssl x509 -noout -text | grep "Not Before:" [15:39:47] Not Before: Nov 19 15:50:45 2018 GMT [15:40:12] vgutierrez, ^ [15:40:14] that's the new one right? [15:40:21] indeed [15:41:14] !log uploaded wmf-pt-kill_2.2.20-1+wmf5 packages to stretch-wikimedia (T209517) [15:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:18] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [15:44:16] ok, all green [15:44:27] the maintenance completed, I repool labsdb1011 [15:45:22] (03PS1) 10Banyek: Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/474929 [15:48:37] (03PS2) 10Alex Monk: librenms: Remove old letsencrypt puppetisation cert [puppet] - 10https://gerrit.wikimedia.org/r/474747 (https://phabricator.wikimedia.org/T209856) [15:49:48] (03PS1) 10Muehlenhoff: Remove Diamond from restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/474930 (https://phabricator.wikimedia.org/T183454) [15:50:11] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/474929 (owner: 10Banyek) [15:50:29] (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/474929 [15:50:32] (03CR) 10Banyek: [V: 032 C: 032] Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/474929 (owner: 10Banyek) [15:51:42] !log repooling labsdb1011 (T209517) [15:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:46] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [15:53:29] (03PS1) 10Jcrespo: mariadb: Repool db2071, depool db2072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474931 [15:53:42] (03PS2) 10Alexandros Kosiaris: Introduce zoterov2 LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/473727 (https://phabricator.wikimedia.org/T201611) [15:54:55] (03CR) 10Volans: [C: 032] Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:54:57] (03PS3) 10Alexandros Kosiaris: Introduce zoterov2 LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/473727 (https://phabricator.wikimedia.org/T201611) [15:55:47] (03PS1) 10Jcrespo: mariadb: Repool db2072 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474932 [15:55:49] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/13617/" [puppet] - 10https://gerrit.wikimedia.org/r/474747 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [15:56:07] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2071, depool db2072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474931 (owner: 10Jcrespo) [15:56:12] (03CR) 10Cwhite: [C: 031] Disable Diamond on Graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/474922 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:56:20] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce zoterov2 LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/473727 (https://phabricator.wikimedia.org/T201611) (owner: 10Alexandros Kosiaris) [15:56:26] (03Merged) 10jenkins-bot: Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:56:58] (03CR) 10Cwhite: [C: 031] Remove Diamond from restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/474930 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:57:07] * banyek afk a bit [15:57:12] (03CR) 10Volans: [C: 032] Add Puppet module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473735 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [15:57:20] (03CR) 10Cwhite: [C: 032] Remove absented Diamond collector for NTP [puppet] - 10https://gerrit.wikimedia.org/r/474887 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:57:41] (03Merged) 10jenkins-bot: mariadb: Repool db2071, depool db2072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474931 (owner: 10Jcrespo) [15:57:48] (03CR) 10Cwhite: [C: 031] Remove ApacheStatusSimpleCollector [puppet] - 10https://gerrit.wikimedia.org/r/474864 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:59:04] (03CR) 10Vgutierrez: [C: 032] librenms: Remove old letsencrypt puppetisation cert [puppet] - 10https://gerrit.wikimedia.org/r/474747 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [15:59:07] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2071, depool db2072 (duration: 00m 47s) [15:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:12] (03PS3) 10Vgutierrez: librenms: Remove old letsencrypt puppetisation cert [puppet] - 10https://gerrit.wikimedia.org/r/474747 (https://phabricator.wikimedia.org/T209856) (owner: 10Alex Monk) [15:59:29] (03Merged) 10jenkins-bot: Add Puppet module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473735 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:02:22] (03CR) 10jenkins-bot: mariadb: Repool db2071, depool db2072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474931 (owner: 10Jcrespo) [16:02:37] 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10Joe) [16:02:42] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review: Allow directing users to PHP7 based on a cookie - https://phabricator.wikimedia.org/T206338 (10Joe) 05Open>03Resolved [16:04:19] (03PS1) 10Elukey: Remove extra '.xml' from ssl-(client|server).xml.erb files [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474935 [16:04:48] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28492 MB (5% inode=99%) [16:04:52] (03CR) 10Elukey: [V: 032 C: 032] Remove extra '.xml' from ssl-(client|server).xml.erb files [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474935 (owner: 10Elukey) [16:05:05] (03CR) 10Alexandros Kosiaris: "This is applied on 2 hosts in production, releases1001 and releases2001 on which intuitively I see no reason for docker." [puppet] - 10https://gerrit.wikimedia.org/r/474825 (https://phabricator.wikimedia.org/T208529) (owner: 10Thcipriani) [16:05:36] 10Operations, 10Traffic: Migrate most standard public TLS certificates to CertCentral issuance - https://phabricator.wikimedia.org/T207050 (10Vgutierrez) [16:05:42] 10Operations, 10Certcentral, 10Traffic: Deploy a certcentral managed TLS certificate for librenms - https://phabricator.wikimedia.org/T209856 (10Vgutierrez) 05Open>03Resolved [16:06:13] vgutierrez, so what next? [16:06:25] tendril? netbox? [16:07:17] (03PS1) 10Elukey: Update cdh submodule to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/474937 [16:07:51] (03CR) 10Muehlenhoff: [C: 031] nginx: remove diamond::collector::nginx reference [puppet/nginx] - 10https://gerrit.wikimedia.org/r/474309 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:10:44] Krenair: so, I'm about to commit the netbox config for certcentral [16:10:49] !log stop and upgrade db2072 [16:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:51] (03CR) 10Thcipriani: "> This is applied on 2 hosts in production, releases1001 and" [puppet] - 10https://gerrit.wikimedia.org/r/474825 (https://phabricator.wikimedia.org/T208529) (owner: 10Thcipriani) [16:10:58] (03CR) 10Elukey: [C: 032] Update cdh submodule to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/474937 (owner: 10Elukey) [16:12:08] (03CR) 10Cwhite: [C: 032] nginx: remove diamond::collector::nginx reference [puppet/nginx] - 10https://gerrit.wikimedia.org/r/474309 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:12:30] (03PS1) 10Vgutierrez: certcentral: Provide a TLS certificate for netbox [puppet] - 10https://gerrit.wikimedia.org/r/474939 (https://phabricator.wikimedia.org/T207050) [16:12:33] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) @fgiunchedi Here are the recommendations from Dell. Dell support didn't find any HW error on the server while looking a the TSR log. He recommended that we clear the log and do some firmware updates... [16:12:41] (03Merged) 10jenkins-bot: nginx: remove diamond::collector::nginx reference [puppet/nginx] - 10https://gerrit.wikimedia.org/r/474309 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:13:45] 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10akosiaris) >>! In T209088#4759458, @thcipriani wrote: >>>! In T209088#4745829, @akosiaris wrote: >> I think we should support multiple tags per image (docker an... [16:14:24] PROBLEM - MariaDB Slave IO: s1 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2072.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2072.codfw.wmnet (111 Connection refused) [16:14:46] (03CR) 10Jforrester: Add federation configs for beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [16:14:48] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [16:14:57] vgutierrez: anything I should do for the netbox certcentral stuff? [16:15:05] that is me. see log^ [16:15:41] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Maps: Review Elastic/maps Grafana dashboards - https://phabricator.wikimedia.org/T209812 (10Mathew.onipe) [16:16:07] volans: not at this stage [16:16:48] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Case information CPU1 - PE R740XD |Warranty ProSupport | SR 982809390 | [16:17:32] 10Operations, 10Maps, 10Discovery-Search (Current work), 10Patch-For-Review: Disable proxy for beta cluster in maps - https://phabricator.wikimedia.org/T209570 (10Gehel) 05Open>03Resolved Done, deployed, and tested [16:18:48] (03PS1) 10Cwhite: nginx: use latest commit [puppet] - 10https://gerrit.wikimedia.org/r/474940 (https://phabricator.wikimedia.org/T183454) [16:19:22] (03PS2) 10Cwhite: nginx: use latest commit [puppet] - 10https://gerrit.wikimedia.org/r/474940 (https://phabricator.wikimedia.org/T183454) [16:22:27] (03PS1) 10Vgutierrez: netbox: Deploy the TLS certificate managed by certcentral [puppet] - 10https://gerrit.wikimedia.org/r/474941 (https://phabricator.wikimedia.org/T207050) [16:22:33] (03PS1) 10GTirloni: toolforge: unbreak 'exec' group in clush [puppet] - 10https://gerrit.wikimedia.org/r/474942 (https://phabricator.wikimedia.org/T209701) [16:22:39] <_joe_> shdubsh: use a more descriptive commit message :) [16:23:12] <_joe_> also: yes, git submodules suck [16:23:12] PROBLEM - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 721.47 seconds [16:23:34] ^will go away soon [16:23:36] (03PS2) 10Cwhite: Remove absented Diamond collector for NTP [puppet] - 10https://gerrit.wikimedia.org/r/474887 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:23:46] 10Operations, 10Maps, 10Patch-For-Review: Switch to unix socket connections for osmupdater / osmimporter for postgresql on maps - https://phabricator.wikimedia.org/T206639 (10Mathew.onipe) [16:23:49] 10Operations, 10Maps, 10Patch-For-Review: Make nodes.bin cache file writable by osmupdater after it is created by osmimporter - https://phabricator.wikimedia.org/T209569 (10Mathew.onipe) 05Open>03Resolved [16:24:14] 10Operations, 10Maps, 10Discovery-Search (Current work), 10Patch-For-Review: Update SQL location script for osm-initial-import - https://phabricator.wikimedia.org/T209566 (10Mathew.onipe) 05Open>03Resolved [16:24:40] (03Abandoned) 10Alexandros Kosiaris: DNM: Dummy testing change [puppet] - 10https://gerrit.wikimedia.org/r/402861 (owner: 10Alexandros Kosiaris) [16:24:49] (03PS1) 10Elukey: Fix variable in container-executor.cfg.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474943 [16:26:13] (03CR) 10Elukey: [V: 032 C: 032] Fix variable in container-executor.cfg.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474943 (owner: 10Elukey) [16:27:10] RECOVERY - Disk space on elastic1025 is OK: DISK OK [16:27:32] (03PS1) 10Elukey: Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/474945 [16:27:57] (03CR) 10Elukey: [V: 032 C: 032] Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/474945 (owner: 10Elukey) [16:28:56] (03PS2) 10Zoranzoki21: Enable autopatroller, patroller and rollbacker rights on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 [16:31:00] (03PS3) 10Zoranzoki21: Enable RCPatrol and add some rights on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 (https://phabricator.wikimedia.org/T209250) [16:31:38] (03Abandoned) 10Zoranzoki21: Enable RCPatrol on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472917 (https://phabricator.wikimedia.org/T209250) (owner: 10Zoranzoki21) [16:39:32] Labsdb1006 maintenance will be 'with announcement' do I need any special announcementor just the SAL one, and cloud handle the others? [16:45:39] (03PS2) 10Herron: logstash: add type "apache-error" and use logstash core patterns [puppet] - 10https://gerrit.wikimedia.org/r/474813 (https://phabricator.wikimedia.org/T205852) [16:45:51] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide a TLS certificate for netbox [puppet] - 10https://gerrit.wikimedia.org/r/474939 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [16:46:02] (03PS2) 10Vgutierrez: certcentral: Provide a TLS certificate for netbox [puppet] - 10https://gerrit.wikimedia.org/r/474939 (https://phabricator.wikimedia.org/T207050) [16:48:51] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Log cleared BIOS, IDRAC, CPLD updated just after this a reboot i get ms-be2047 login: [ 411.875775] mce: [Hardware Error]: CPU 17: Machine Check Exception: 5 Bank 14: b000000000020405 [ 411.88... [16:49:09] arturo^ bstrom_ ^ ? [16:49:35] bstorm_ ^^ [16:49:59] banyek: in a meeting right now. The announcement is for final users (i.e, CloudVPS/toolforge users) and brooke already sent those [16:50:18] ok, thanks I'll keep logging here then [16:52:04] RECOVERY - MariaDB Slave IO: s1 on db2094 is OK: OK slave_io_state Slave_IO_Running: Yes [16:52:45] it didn't want to restart I had to force a powercycle [16:53:11] !log rollback all BFD tests on cr1-codfw [16:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:44] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2072 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474932 (owner: 10Jcrespo) [16:55:52] 10Operations, 10SRE-Access-Requests: Requesting access to Jupyter notebook / analytics-privatedata-users for jgleeson - https://phabricator.wikimedia.org/T208432 (10RobH) a:03XenoRyet Please note I'm triaging this as part of clinic duty. In review, it seems that all criteria have been met, save one. We nee... [16:56:05] (03Merged) 10jenkins-bot: mariadb: Repool db2072 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474932 (owner: 10Jcrespo) [16:56:07] (03CR) 10Filippo Giunchedi: logstash: add type "apache-error" and use logstash core patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474813 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [16:56:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Mukunda to releasers-mediawiki - https://phabricator.wikimedia.org/T209176 (10RobH) 05Open>03Resolved a:03RobH Please note this went live last week. I'm resolving this task. If it turns out this has issues, simply reopen the task. [16:56:50] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28688 MB (5% inode=99%) [16:57:51] (03PS5) 10CRusnov: Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [16:58:42] RECOVERY - MariaDB Slave Lag: s1 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:59:02] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2072 (duration: 00m 46s) [16:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181120T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:34] 10Operations, 10Maps, 10Traffic, 10Reading-Infrastructure-Team-Backlog (Kanban): Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Mholloway) @Gehel, do you mean specifically the additional EventBus load generated by activating resource_change events sent from Til... [17:00:40] (03CR) 10jerkins-bot: [V: 04-1] Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [17:00:59] (03CR) 10CRusnov: "Latest patch set tests 100% correct thanks to offline discussions. Also fixes imports." (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [17:04:48] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10RobH) In the future, please file each task for each user individually. This could very quickly become difficult to keep sorted, and now all 3 folks acce... [17:05:19] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10RobH) [17:05:56] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10RobH) [17:07:35] (03CR) 10jenkins-bot: mariadb: Repool db2072 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474932 (owner: 10Jcrespo) [17:10:30] RECOVERY - Disk space on elastic1025 is OK: DISK OK [17:10:32] (03CR) 10GTirloni: [C: 032] toolforge: unbreak 'exec' group in clush [puppet] - 10https://gerrit.wikimedia.org/r/474942 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:10:42] (03PS2) 10GTirloni: toolforge: unbreak 'exec' group in clush [puppet] - 10https://gerrit.wikimedia.org/r/474942 (https://phabricator.wikimedia.org/T209701) [17:15:23] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10RobH) 05stalled>03Resolved As this request was granted, and is pending resolution just confirming it works, I'll ju... [17:15:33] banyek: we are getting ready to do the reboots soon of labsdb1004 and labsdb1006. You probably don't have to do much at all, and 1006 is postgres anyway. We'd just like you to confirm that lasbsd1004 comes back online correctly when we are done (since there won't be any mariadb upgrade--I think?). [17:15:57] 👍 [17:16:45] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10BPirkle) [17:17:38] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10BPirkle) [17:17:46] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 54.3 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [17:19:40] starting on labsdb1004 now. banyek, do you want to ensure the mariadb process is down and ready? Once that is down, I'm shutting off postgres, I'll run update and reboot [17:19:51] I can shut off both services really.. [17:19:53] yes, I log in now [17:19:54] !log reload nginx configuration on elasticsearch codfw [17:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:15] postgres is down [17:20:38] RECOVERY - Elasticsearch HTTPS for production-search-psi-codfw on elastic2030 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1735 days) [17:20:47] All special pages on beta cluster give server errors currently. [17:21:04] RECOVERY - Elasticsearch HTTPS for production-search-psi-codfw on elastic2027 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1735 days) [17:21:06] RECOVERY - Elasticsearch HTTPS for production-search-psi-codfw on elastic2035 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1735 days) [17:21:58] banyek: is mariadb ready for reboot? [17:22:14] not yey [17:22:16] not yet [17:22:20] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 74.81 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [17:22:24] Ok. let me know. We want to be quick :) [17:22:31] wikilabels is down for now [17:22:38] sure thing [17:22:56] ok, mysql down [17:23:12] you can do the reboot [17:24:26] please make sure packages are actually upgraded before the reboot :-) [17:24:41] upgrading now :) [17:25:37] !log rebooting labsdb1004 for upgrades T209517 [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:40] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [17:27:35] (03PS7) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) [17:28:01] (03CR) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [17:28:56] banyek: it is up, can you confirm all is well? [17:29:25] postgres looks fine [17:29:25] checking [17:29:57] This things remote terminal is screwed, btw. [17:30:02] mysqld is not starting [17:30:04] I couldn't watch the reboot. [17:30:06] Ah. [17:30:08] fun [17:30:08] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10RobH) [17:30:28] (03PS8) 10Herron: rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) [17:30:40] banyek: do you want me to take a look? [17:31:03] I'd appreciate it, because this host seems different than the others [17:31:16] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10RobH) a:05MoritzMuehlenhoff>03Volans So, in discussion with @Volans, it seems this could be read only for all WMF. Assigning to him for followup. My concern... [17:31:49] banyek: you mean mgmt? I can't login either :-/ [17:31:52] Loaded: loaded (/etc/init.d/mariadb) [17:32:04] so on jessie we didn't properly integrated systemd [17:32:09] sorry I meant bstorm_ [17:32:11] init.d was still being used [17:32:14] mostly due to 10.0 [17:32:38] and it's starting now [17:32:40] (03CR) 10Herron: [C: 032] rsyslog::input::file add rsyslog imfile wrapper for file ingestion [puppet] - 10https://gerrit.wikimedia.org/r/474760 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [17:32:44] arturo: Yeah. It came up, but that never inspires confidence. [17:32:47] I started it [17:32:52] with init.d [17:33:05] and disabled the systemd wrapper [17:33:13] I started it too [17:33:53] the processes are up and running [17:33:54] shutdown and start seems clean [17:34:00] and replication catches up [17:34:15] when I shut it down I used init.d [17:34:25] I wasn't able to start it first because [17:34:30] *drumroll* [17:34:36] I forgot to suod [17:34:39] sudo [17:34:42] bstorm_: arturo we should soon migrate to stretecy/10.1 on the new hosts [17:35:08] Yup [17:35:12] bstorm_: labsdb1004 is clear here, you can do labsdb1006 [17:35:26] We have been held up on hardware for moving them to VMs :-/ [17:35:30] bstorm_: arturo what is the 1 line summary of the status od that? [17:35:40] is it finally fixed or not yet? [17:35:43] When moved to VMs, they will be stretch. [17:35:51] Maybe? HP says our monitor is wrong [17:35:52] jynus: status of what? [17:36:06] arturo: bstorm_ is telling me [17:36:09] arturo: labvirts1019 and 1020 [17:36:10] ok [17:36:18] thanks banyek! [17:36:25] * jynus crosses fingers [17:36:29] :-P [17:36:40] some people complained not being able to use some 10.1 features [17:36:45] 10Operations, 10netops: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 (10ayounsi) p:05Triage>03Normal [17:37:10] if it takes too loong we should consider 10.3 instead, of which I have experimental packages [17:37:19] Fair :) [17:37:26] replication is catching up [17:37:31] labsdb1004 catched up [17:37:33] everthing seems ok on our side [17:37:48] great [17:37:59] labsdb1006 is showing puppet is broken [17:38:23] onimisionipe: you around? [17:38:37] I'll wait to downtime it until we can sort that [17:39:42] arturo: yep [17:39:46] onimisionipe: https://www.irccloud.com/pastebin/ORPZjQVM/ [17:39:56] could this be related to your recent patch? [17:40:05] (03PS6) 10CRusnov: Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [17:40:20] arturo.. Yes! [17:40:28] I will fix it [17:40:57] onimisionipe: thanks! do you think the fix can be now? so we can keep with our reboot schedules [17:41:09] 10Operations, 10netops: asw2-a-eqiad FPC2 reboot - https://phabricator.wikimedia.org/T209588 (10ayounsi) 05Open>03Resolved Not much we can do here, if it happen again though, we should RMA the device. [17:41:33] Great :) [17:46:20] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:26] arturo: I don't have access to that server. [17:47:45] gehel ^^ [17:47:59] onimisionipe: you want to live-patch or something? [17:49:58] bstorm_: I would say reboot labsdb1006 anyway? We had this window scheduled and announced [17:50:07] I would like to avoid scheduling this again [17:50:15] As long as package upgrades won't overwrite anything [17:50:17] I'll check [17:50:18] arturo: I will ping Guillaume. He knows what to do with this [17:50:26] but you can go ahead and reboot [17:50:30] thanks onimisionipe [17:50:34] yw! [17:50:40] arturo: unintended impact from a maps change, rolling back [17:51:07] thanks gehel! :-) [17:51:27] (03PS1) 10Gehel: Revert "profile::maps::osm_master: change osmupdater and osmimporter auth method to peer" [puppet] - 10https://gerrit.wikimedia.org/r/474955 [17:51:30] Thanks!! [17:51:39] crap, how did we miss that :( [17:51:53] no icinga checks or something? [17:51:56] * gehel forgets that there is another osm server somewhere [17:52:43] (03PS3) 10Herron: logstash: add type "apache-error" and use logstash core patterns [puppet] - 10https://gerrit.wikimedia.org/r/474813 (https://phabricator.wikimedia.org/T205852) [17:52:44] gehel: let me know when I should try the puppet run again [17:53:00] there is an icinga check, but no alert? I don't really know [17:53:08] (03CR) 10Gehel: [C: 032] Revert "profile::maps::osm_master: change osmupdater and osmimporter auth method to peer" [puppet] - 10https://gerrit.wikimedia.org/r/474955 (owner: 10Gehel) [17:53:32] (03PS2) 10Gehel: Revert "profile::maps::osm_master: change osmupdater and osmimporter auth method to peer" [puppet] - 10https://gerrit.wikimedia.org/r/474955 [17:53:35] (03CR) 10Gehel: [V: 032 C: 032] Revert "profile::maps::osm_master: change osmupdater and osmimporter auth method to peer" [puppet] - 10https://gerrit.wikimedia.org/r/474955 (owner: 10Gehel) [17:54:06] (03PS1) 10Gehel: Revert "maps: added use_proxy flag to set proxy" [puppet] - 10https://gerrit.wikimedia.org/r/474956 [17:55:18] (03CR) 10Gehel: [C: 032] Revert "maps: added use_proxy flag to set proxy" [puppet] - 10https://gerrit.wikimedia.org/r/474956 (owner: 10Gehel) [17:55:32] (03PS2) 10Gehel: Revert "maps: added use_proxy flag to set proxy" [puppet] - 10https://gerrit.wikimedia.org/r/474956 [17:55:34] (03CR) 10Gehel: [V: 032 C: 032] Revert "maps: added use_proxy flag to set proxy" [puppet] - 10https://gerrit.wikimedia.org/r/474956 (owner: 10Gehel) [17:56:05] (03PS3) 10Ladsgroup: Add federation configs for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) [17:56:06] bstorm_, arturo: we should be good now (cc onimisionipe) [17:56:15] Ok great [17:56:16] cool [17:56:29] onimisionipe: we'll need to go back to those changes [17:56:35] Running puppet, downtimed, then I'll stop the service upgrade and such. [17:56:41] bstorm_, arturo sorry for the mess! [17:56:46] (03PS7) 10CRusnov: Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [17:57:10] gehel: the problem with the patch is perhaps some callers missing the new added parameters. Not a big deal :-) thanks!! [17:57:48] yeah, it's probably not hard to fix, but let's take our time to do it right this time [17:58:01] (03CR) 10Ladsgroup: [C: 032] "Noop for prod. Labs only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181120T1800). [18:01:15] (03PS4) 10Ladsgroup: Add federation configs for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) [18:01:28] (03CR) 10Ladsgroup: [C: 032] Add federation configs for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [18:01:45] good luck Amir1 [18:01:46] :P [18:02:31] (03Merged) 10jenkins-bot: Add federation configs for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [18:02:47] addshore: Thanks :D [18:03:37] !log rebooting labsdb1006 for upgrades T209517 [18:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:41] T209517: Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 [18:04:35] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@7553087]: Deploy 2018 app fundraising announcement config (T204821) [18:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:39] T204821: 2018 English campaign fundraising on apps - https://phabricator.wikimedia.org/T204821 [18:06:21] (03PS1) 10Muehlenhoff: Set spare role for cloustore1008/1009 [puppet] - 10https://gerrit.wikimedia.org/r/474959 (https://phabricator.wikimedia.org/T209527) [18:06:36] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:08:12] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@7553087]: Deploy 2018 app fundraising announcement config (T204821) (duration: 03m 37s) [18:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:46] onimisionipe: I see a lot of this in the database log: `2018-11-20 18:04:03 GMT ERROR: permission denied for relation planet_osm_polygon` The reboot didn't cause it. It was already throwing that error. I'm not sure if anything is needed on there or if that is expected. [18:11:26] I don't really know postgres, but it seems like a GRANT problem to me [18:11:50] (03PS9) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [18:12:02] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:12:06] (03CR) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [18:12:40] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [18:12:45] (03CR) 10jenkins-bot: Add federation configs for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474889 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [18:15:14] bstorm_: This is still related to the last set of patches. CRs were reverted. It should be fine [18:15:27] please don't deploy anything, it might break things [18:15:34] (mediawiki config) [18:17:13] onimisionipe: ok. It seems to be continuing to throw errors for perms. I'm concerned its actually broken still. [18:17:19] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Create Federated Wikibase instance on Beta Commons (T204748) (duration: 00m 48s) [18:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:23] T204748: Create Federated Wikibase instance on Beta Commons - https://phabricator.wikimedia.org/T204748 [18:17:48] The postgres database is, that is [18:19:14] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: Create Federated Wikibase instance on Beta Commons, part II (T204748) (duration: 00m 47s) [18:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:28] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) hey @MoritzMuehlenhoff aren't cloudvirts affected by this? [18:19:44] gehel ^^ database is still throwing permission errors [18:21:00] I'm done for the prod stuff [18:21:45] bstorm_: so, the change should not have touched the permissions themselves, it should have just added an entry to pghba.conf for peer authentication [18:21:54] Ok. [18:22:57] I'm guessing the CR was applied halfway or something. Maybe the OS user for that pg acct has not been created. [18:23:09] gehel: it may have blown away local changes on pg_hba.conf looking at it [18:23:39] Nah...I recognize these entries. [18:23:41] That couldnt' have happened [18:23:58] I'll reach out to other folks. Might be a totally different issue :) [18:24:48] yeah, it does not look like it has changed [18:25:00] bstorm_: what's the user that's broken? [18:26:35] gehel: Just looking in the log, so I'm not sure :) I'm not sure if any of the perms are right. I'd have to check with people running stuff on here. [18:27:05] every statement that shows in the log has a perm denied on it [18:27:07] yeah, and since puppet was failing compilation, it should not have introduced any change [18:27:18] That's what I'd expect [18:27:29] This did predate my work as well [18:27:41] Of course, that log is probably scrolled away. [18:27:47] 🤷🏻 [18:35:44] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10MoritzMuehlenhoff) See the task description, "For labvirt/cloudvirt I'll create a separate ticket as more steps are necessary." This needs a backport of SSBD support for the qemu version cl... [18:35:53] (03CR) 10Muehlenhoff: [C: 032] Set spare role for cloustore1008/1009 [puppet] - 10https://gerrit.wikimedia.org/r/474959 (https://phabricator.wikimedia.org/T209527) (owner: 10Muehlenhoff) [18:36:39] (03PS2) 10Muehlenhoff: Fix help text [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/474674 [18:36:51] (03CR) 10Muehlenhoff: [V: 032 C: 032] Fix help text [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/474674 (owner: 10Muehlenhoff) [18:37:40] (03PS2) 10Muehlenhoff: Remove PyBalStateCollector [puppet] - 10https://gerrit.wikimedia.org/r/474865 (https://phabricator.wikimedia.org/T183454) [18:41:16] (03PS1) 10Ladsgroup: Use integer for namespace id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474963 (https://phabricator.wikimedia.org/T204748) [18:41:39] (03CR) 10Ladsgroup: [C: 032] Use integer for namespace id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474963 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [18:42:41] (03Merged) 10jenkins-bot: Use integer for namespace id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474963 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [18:43:56] ^ rebased [18:45:15] aaah yes Amir1 [18:45:23] (03PS2) 10Muehlenhoff: Remove ApacheStatusSimpleCollector [puppet] - 10https://gerrit.wikimedia.org/r/474864 (https://phabricator.wikimedia.org/T183454) [18:45:26] we should define the constants in commonsettings perhaps [18:45:50] (03CR) 10Jforrester: "Sad that this is necessary. :-(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474963 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [18:46:15] addshore: The fun part is that testing beta is hard because it's under migration to neutron [18:46:28] and everything is broken already [18:46:41] (03CR) 10Muehlenhoff: [C: 032] Remove ApacheStatusSimpleCollector [puppet] - 10https://gerrit.wikimedia.org/r/474864 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [18:49:36] (03PS4) 10Zoranzoki21: Enable RCPatrol and add some rights on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 (https://phabricator.wikimedia.org/T209250) [18:50:19] (03PS2) 10Muehlenhoff: Remove Diamond from DB roles [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) [18:50:57] (03PS5) 10Zoranzoki21: Enable RCPatrol and add some rights on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472744 (https://phabricator.wikimedia.org/T209250) [18:51:59] (03CR) 10jenkins-bot: Use integer for namespace id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474963 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [18:55:58] (03PS2) 10Bstorm: openstack client: Install python3 stuff on stretch [puppet] - 10https://gerrit.wikimedia.org/r/474795 (https://phabricator.wikimedia.org/T200557) [18:56:39] !log start loading dumps into elastic codfw omega and psi from mwmaint2001 [18:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:56] (03CR) 10Bstorm: [C: 032] openstack client: Install python3 stuff on stretch [puppet] - 10https://gerrit.wikimedia.org/r/474795 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [19:07:47] (03PS1) 10Zoranzoki21: Disable FlaggedRevs, enable RC patrol and add rights on srwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474967 (https://phabricator.wikimedia.org/T209251) [19:13:57] (03PS4) 10Herron: logstash: add type "apache2-error" and use logstash core patterns [puppet] - 10https://gerrit.wikimedia.org/r/474813 (https://phabricator.wikimedia.org/T205852) [19:14:39] (03PS1) 10Mathew.onipe: osm::master: update parameters for osm::planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/474968 [19:15:05] (03CR) 10jerkins-bot: [V: 04-1] osm::master: update parameters for osm::planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/474968 (owner: 10Mathew.onipe) [19:22:03] (03CR) 10Volans: [C: 031] "Nice! Looks good to me." [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [19:22:17] (03PS2) 10Mathew.onipe: osm::master: update parameters for osm::planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/474968 [19:22:30] 10Operations, 10cloud-services-team (Kanban): netbox: wmcs reports - https://phabricator.wikimedia.org/T208576 (10GTirloni) So in other words, we want to look at the WMF fleet of servers and answer questions like: * Which servers are managed by WMCS? * Which role those servers have (hypervisor, DB, load balan... [19:23:21] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10GTirloni) [19:24:01] (03CR) 10Herron: logstash: add type "apache2-error" and use logstash core patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474813 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [19:24:07] (03PS5) 10Herron: logstash: add type "apache2-error" and use logstash core patterns [puppet] - 10https://gerrit.wikimedia.org/r/474813 (https://phabricator.wikimedia.org/T205852) [19:28:04] (03CR) 10Herron: [C: 032] logstash: add type "apache2-error" and use logstash core patterns [puppet] - 10https://gerrit.wikimedia.org/r/474813 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [19:40:52] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [19:45:04] (03PS1) 10Thcipriani: Scap prep should use latest MediaWiki version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474971 [19:46:53] (03PS1) 10Herron: peopleweb: ingest apache error logs with rsyslog for shipping [puppet] - 10https://gerrit.wikimedia.org/r/474973 (https://phabricator.wikimedia.org/T209860) [19:47:11] (03CR) 10Aaron Schulz: [C: 031] RunSingleJob: Check that JobExecutor has been loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474885 (https://phabricator.wikimedia.org/T208922) (owner: 10Mobrovac) [19:48:46] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10RobH) So, this is my 2 cents and what I understand of netbox. @volans is the expert! I'm just on clinic duty and noticed this scroll across my feed, and figured I could clarify with... [19:50:21] (03PS1) 10Herron: rsyslog: output logs tagged 'input-file-apache2-error' to kafka [puppet] - 10https://gerrit.wikimedia.org/r/474974 (https://phabricator.wikimedia.org/T209860) [19:53:23] (03CR) 10Herron: [C: 032] rsyslog: output logs tagged 'input-file-apache2-error' to kafka [puppet] - 10https://gerrit.wikimedia.org/r/474974 (https://phabricator.wikimedia.org/T209860) (owner: 10Herron) [19:56:49] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:57:51] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:58:26] (03PS1) 10Bstorm: sonofgridengine: handle when an exec host is new and lacks a config [puppet] - 10https://gerrit.wikimedia.org/r/474975 (https://phabricator.wikimedia.org/T200557) [19:59:19] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: handle when an exec host is new and lacks a config [puppet] - 10https://gerrit.wikimedia.org/r/474975 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [19:59:30] (03CR) 10Herron: [C: 032] peopleweb: ingest apache error logs with rsyslog for shipping [puppet] - 10https://gerrit.wikimedia.org/r/474973 (https://phabricator.wikimedia.org/T209860) (owner: 10Herron) [19:59:36] (03PS2) 10Herron: peopleweb: ingest apache error logs with rsyslog for shipping [puppet] - 10https://gerrit.wikimedia.org/r/474973 (https://phabricator.wikimedia.org/T209860) [20:01:42] (03PS2) 10Bstorm: sonofgridengine: handle when an exec host is new and lacks a config [puppet] - 10https://gerrit.wikimedia.org/r/474975 (https://phabricator.wikimedia.org/T200557) [20:02:56] (03CR) 10Bstorm: [C: 032] sonofgridengine: handle when an exec host is new and lacks a config [puppet] - 10https://gerrit.wikimedia.org/r/474975 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:03:07] (03PS3) 10Bstorm: sonofgridengine: handle when an exec host is new and lacks a config [puppet] - 10https://gerrit.wikimedia.org/r/474975 (https://phabricator.wikimedia.org/T200557) [20:05:00] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10aaron) It definitely seems like something worth doing. Having the potential for high use cache k... [20:06:59] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10Imarlier) If we do go this route, we should have monitoring/alerting on access to the gutter boxes. [20:07:17] (03PS1) 10Zoranzoki21: Enable suppressredirect on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474976 (https://phabricator.wikimedia.org/T210000) [20:13:11] (03CR) 10Eevans: [C: 031] Remove Diamond from restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/474930 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [20:14:21] bstorm_: have you found the issue with labsdb1006? [20:14:55] I'm mostly out for today, but let me know if you want me to have a look tomorrow, or if it is urgent enough, I can dig into it later tonight [20:16:29] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Ship peopleweb apache2 logs to ELK - https://phabricator.wikimedia.org/T209860 (10herron) Apache error logs for peopleweb (rutherfordium) are now present in ELK. These can be found by searching in Kibana for `host:rutherfordium`, or `program.raw:input-... [20:18:09] gehel: I have punted so far. Hoping others know more and if there are problems, I get reports [20:29:49] (03PS1) 10Herron: rsyslog: ship logs with tag 'icinga' to kafka [puppet] - 10https://gerrit.wikimedia.org/r/474982 (https://phabricator.wikimedia.org/T7) [20:35:03] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/13622/" [puppet] - 10https://gerrit.wikimedia.org/r/474982 (https://phabricator.wikimedia.org/T7) (owner: 10Herron) [20:43:25] 10Operations, 10ChangeProp, 10SCB, 10User-jijiki: Memory consumption in Redis 3.2 vs Redis 2.8 - https://phabricator.wikimedia.org/T209890 (10Pchelolo) There actually is something wrong here. The [[ https://grafana.wikimedia.org/dashboard/db/eventbus?refresh=1m&panelId=37&fullscreen&orgId=1&from=now-6h&to=... [20:47:54] (03PS1) 10Herron: phabricator: ship apache error logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/474988 (https://phabricator.wikimedia.org/T141895) [20:50:59] 10Operations, 10ChangeProp, 10SCB, 10Services (watching), 10User-jijiki: Memory consumption in Redis 3.2 vs Redis 2.8 - https://phabricator.wikimedia.org/T209890 (10Pchelolo) [20:54:09] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [20:54:25] (03PS4) 10Cwhite: initial commit [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) [20:54:36] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [20:55:20] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10Volans) [20:55:40] 10Operations, 10DBA, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Investigate solutions for MySQL connection pooling - https://phabricator.wikimedia.org/T196378 (10aaron) >>! In T196378#4550382, @jcrespo wrote: > You can help on your side (mediawiki) in parallel by preparing a way (conf... [20:59:03] 10Operations, 10Fundraising-Backlog, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) 05Open>03Resolved @jkim_wikimedia cleaning house, please re-open if you need help. [20:59:08] (03PS1) 10Herron: jenkins: ship syslogs tagged 'jenkins' to ELK [puppet] - 10https://gerrit.wikimedia.org/r/474990 (https://phabricator.wikimedia.org/T143733) [21:00:25] (03CR) 10Cwhite: [C: 031] rsyslog: ship logs with tag 'icinga' to kafka [puppet] - 10https://gerrit.wikimedia.org/r/474982 (https://phabricator.wikimedia.org/T7) (owner: 10Herron) [21:02:14] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/13624/" [puppet] - 10https://gerrit.wikimedia.org/r/474990 (https://phabricator.wikimedia.org/T143733) (owner: 10Herron) [21:02:35] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/13623/" [puppet] - 10https://gerrit.wikimedia.org/r/474988 (https://phabricator.wikimedia.org/T141895) (owner: 10Herron) [21:07:37] (03PS2) 10Herron: rsyslog: ship logs with tag 'icinga' to ELK [puppet] - 10https://gerrit.wikimedia.org/r/474982 (https://phabricator.wikimedia.org/T7) [21:10:12] (03PS1) 10Ladsgroup: labs: Add mediainfo to federation config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474993 (https://phabricator.wikimedia.org/T204748) [21:22:11] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10herron) [21:22:22] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Ship peopleweb apache2 error logs to ELK - https://phabricator.wikimedia.org/T209860 (10herron) [21:24:15] (03PS1) 10Bstorm: sonofgridengine: take unnecessary remove out of script [puppet] - 10https://gerrit.wikimedia.org/r/474996 (https://phabricator.wikimedia.org/T200557) [21:24:55] (03CR) 10Jforrester: labs: Add mediainfo to federation config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474993 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [21:25:50] (03CR) 10Bstorm: [C: 032] sonofgridengine: take unnecessary remove out of script [puppet] - 10https://gerrit.wikimedia.org/r/474996 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [21:28:12] (03CR) 10Ladsgroup: [C: 032] "noop in prod" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474993 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [21:29:31] (03Merged) 10jenkins-bot: labs: Add mediainfo to federation config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474993 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [21:29:33] (03PS10) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [21:30:16] (03CR) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [21:30:24] ^ rebased on deploy1001 [21:30:46] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [21:34:54] (03CR) 10jenkins-bot: labs: Add mediainfo to federation config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474993 (https://phabricator.wikimedia.org/T204748) (owner: 10Ladsgroup) [21:36:29] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10ayounsi) I think the main risk with Netbox and its ton of great features is to have outdated/incorrect data down the road. For example it's easy to mass import facts from our current i... [21:41:07] (03PS11) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [21:41:56] (03CR) 10jerkins-bot: [V: 04-1] icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) (owner: 10Dzahn) [21:46:15] (03PS12) 10Dzahn: icinga/planet: add generic check_lastmod plugin and check planet updates [puppet] - 10https://gerrit.wikimedia.org/r/472713 (https://phabricator.wikimedia.org/T203208) [21:55:10] 10Operations: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) [21:57:42] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10faidon) Thanks @GTirloni and @aborrero, useful conversation to have for sure :) A few thoughts on multiple fronts: - Netbox is not currently envisioned to be a configuration managemen... [22:01:16] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10RyanSteinberg) wikitech info for @RyanSteinberg Username: RyanSteinberg Instance shell account name: ryanmax [22:01:20] (03CR) 10Hashar: [C: 031] "Jenkins is run as a foreground process managed by systemd. There is a logging config file somewhere, but in short it spurts logs at WARNI" [puppet] - 10https://gerrit.wikimedia.org/r/474990 (https://phabricator.wikimedia.org/T143733) (owner: 10Herron) [22:06:01] 10Operations, 10Puppet, 10User-herron: Knock down puppet 4 deprecation warnings - https://phabricator.wikimedia.org/T193664 (10hashar) I think we got rid of all puppet deprecation. At least PuppetSyntax does not report any and we can make it stricter. That is the subject of T154915 and I have prepared https:... [22:07:23] (03CR) 10Pmiazga: [C: 032] BC Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:08:02] (03CR) 10Pmiazga: [C: 032] Doc: add repoConceptBaseUri comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:09:06] (03CR) 10jerkins-bot: [V: 04-1] BC Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:11:07] Dereckson: Bonne soirée. Est que vous avez une minute s.v.p? [22:11:37] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team, 10User-Smalyshev: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) Tried with incoming stream, the machine can't kee... [22:13:21] !log create volans account on routers - T208726 [22:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:24] T208726: Access to network devices for Riccardo (volans) - https://phabricator.wikimedia.org/T208726 [22:14:11] (03PS8) 10Pmiazga: Doc: add repoConceptBaseUri comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:15:08] hey, I'm going to to merge two config patches one InitialiseSettigns but no-op - just adding a documentation) and second one the beta config change [22:15:52] (03PS4) 10Pmiazga: beta: Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:18:40] (03PS5) 10Pmiazga: beta: Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:21:44] Hello. [22:21:51] (03CR) 10Cwhite: "Searching puppetdb, it appears that the memcached exporter is missing from these hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/469250 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [22:21:54] Hauskatze: bonne soirée. Yes, I've. [22:21:58] (03CR) 10Pmiazga: [C: 032] beta: Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:22:29] Dereckson: :D great.- It's T210011 [22:22:30] T210011: Namespace conflicts on simple.wikiquote - https://phabricator.wikimedia.org/T210011 [22:22:43] (03CR) 10Cwhite: [C: 04-1] "Blocked by: Ie3ac2e0369ae657ec5808a0a9e642bf90d9ebbc6" [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [22:23:19] (03Merged) 10jenkins-bot: beta: Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:26:07] (03CR) 10jenkins-bot: Doc: add repoConceptBaseUri comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:26:09] (03CR) 10jenkins-bot: beta: Wikibase: override repoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [22:31:36] (03PS1) 10MaxSem: Enable SVGs in page in group1, rest of group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475005 (https://phabricator.wikimedia.org/T208899) [22:33:34] Niharika: at 4^ [22:35:33] Hauskatze: can do [22:36:19] ^^ [22:36:21] Hauskatze: you need to unlock it something like that before or it's okay to fix it in closed mode? [22:36:36] Dereckson: it's okay to fix in closed mode [22:36:42] a closed wiki is a wiki only stewards can edit, isn't it? [22:36:52] (03PS1) 10Pmiazga: noop: Remove utf-8 characters from DOC comment for better readability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475006 [22:37:08] Dereckson: yes, but namespaceDupes shouldn't face any troubles [22:37:36] Pmiazga > set your console in UTF-8 ? [22:37:46] (03CR) 10Pmiazga: [C: 032] noop: Remove utf-8 characters from DOC comment for better readability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475006 (owner: 10Pmiazga) [22:38:01] if there's anything that needs to be done afterwards I can get it done locally; it won't alter its contents so it shouldn't be controversial [22:39:39] raynor: there are UTF-8 characters everywhere on this file, for example at the top to be sure it's opened in an UTF-8 console, or in namespaces names [22:40:22] Dereckson, that were put there by mistake, gerrit didn't show them. I wouldn't worry if this was an UTF character when it comes to the wiki name or a text in the wiki [22:40:50] Yes, but it's not responsible to use a non UTF-8 console to work on this repo. [22:40:59] but that was a typo, instead of - editor saved the en dash, that looks like that '–' [22:41:36] ah, I didn't know that, although I think thats a good code hygiene [22:42:32] to make sure that comments are readable on everything, because as I said, that utf-8 chars are just the editor (software) fantasy [22:43:05] Sure, this dash doesn't offer any advantage. But to hunt other UTF-8 characters wouldn't be productive (and your editor by the way is probably right to use a long dash in this context) [22:43:05] (03Merged) 10jenkins-bot: noop: Remove utf-8 characters from DOC comment for better readability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475006 (owner: 10Pmiazga) [22:43:47] There is also another thing peculiar on this repository: we want to maintain a parity between the repository code and the the code deployed on the server [22:43:53] (03PS1) 10Cwhite: mw_rc_irc: ensure diamond::collector absent [puppet] - 10https://gerrit.wikimedia.org/r/475009 (https://phabricator.wikimedia.org/T183454) [22:43:58] so you should only merge it if you're going to deploy it [22:44:06] Dereckson - yeah, I don't worry about different one, but I care about the one I merged earlier. [22:44:17] (also for the no op, beta cluster, etc. changes) [22:44:33] and yes, I'm going to deploy it right now, thats why I pushed it so fast [22:44:40] (03PS1) 10Cwhite: mw_rc_irc: remove diamond::collector resource [puppet] - 10https://gerrit.wikimedia.org/r/475010 (https://phabricator.wikimedia.org/T183454) [22:44:49] as I said earlier "hey, I'm going to to merge two config patches one InitialiseSettigns but no-op - just adding a documentation) and second one the beta config change" [22:44:58] * Dereckson nods [22:44:59] fine so [22:46:05] yeah, I know that all config changes have to be rebased and deployed [22:46:40] (03PS2) 10Cwhite: mw_rc_irc: remove diamond::collector resource and collector script [puppet] - 10https://gerrit.wikimedia.org/r/475010 (https://phabricator.wikimedia.org/T183454) [22:48:59] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite) [22:50:25] (03CR) 10jenkins-bot: noop: Remove utf-8 characters from DOC comment for better readability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475006 (owner: 10Pmiazga) [22:51:02] Hauskatze: for simplewikiquote, we need another hypothesis, all pages and links look good [22:52:22] Hauskatze: it seems more an interwiki issue than a namespace issue [22:53:19] Dereckson: aha - but we cannot prevent "q:" from working there so maybe we can just leave the issue as it is [22:53:31] probably [22:53:35] breaking interwiki map uniformity is not adviceable [22:54:19] !log pmiazga@deploy1001 Synchronized wmf-config: SYNC: noop [[gerrit:473292|Doc: add repoConceptBaseUri comment (T209352)]][[gerrit:475006|noop: Remove utf-8 characters from DOC comment for better readability (T209352)]][[gerrit:473835|beta: Wikibase: override repoConceptBaseUri (T209352)]] (duration: 00m 49s) [22:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:25] T209352: [Spike 2hrs] BetaCluster incorrectly points SameAS to production Wikidata - https://phabricator.wikimedia.org/T209352 [22:55:49] Dereckson: I noted on Meta that such a title starting with Q: will have unwanted effects [22:56:05] + creating stuff is not suposed to happen there [22:56:16] it seems I misunderstood the problem [22:56:28] probably because I should be sleeping instead of wiki-ing ;) [22:58:16] 10Operations, 10Analytics, 10SRE-Access-Requests: Allow access to Data Lake/Hive for Niharika - https://phabricator.wikimedia.org/T210022 (10Nuria) [22:58:41] looks like the beta is down, and I think it might be because of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/473835/, I'm checking that [23:01:14] !log create vol.ans account on switches - T208726 [23:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:20] T208726: Access to network devices for Riccardo (volans) - https://phabricator.wikimedia.org/T208726 [23:01:33] Dereckson, how can I access deployment-deploy01.deployment-prep.eqiad.wmflabs? do I need some special access? [23:01:43] I'm trying to check the logstash-beta [23:01:57] logstash-beta.wmflabs.org [23:02:21] yeah, but for that I need login and pass, I stored that somewhere but I have no idea where ;/ [23:02:58] 10Operations, 10SRE-Access-Requests: Requesting access to production hosts for Jeena Huneidi - https://phabricator.wikimedia.org/T210027 (10jeena) [23:03:27] raynor: ah, yeah, the secrets are on deployment-deploy01 [23:03:41] I think they need to add you to the deployment-tin project [23:04:07] greg-g managed my access there ages ago, not sure how it happens nowadays [23:04:10] 10Operations, 10SRE-Access-Requests: Requesting access to production hosts for Jeena Huneidi - https://phabricator.wikimedia.org/T210027 (10jeena) [23:04:20] you can ask on #wikimedia-releng I guess [23:04:36] I've seen some access created there [23:04:59] good night people [23:05:04] bye [23:08:42] 10Operations, 10SRE-Access-Requests: Requesting access to production hosts for Jeena Huneidi - https://phabricator.wikimedia.org/T210027 (10greg) I approve Jeena's addition to these groups. [23:10:00] yeah, ask in -releng raynor. It's a "if we trust you we'll add you" thing. [23:14:54] 10Operations, 10netops: Access to network devices for Riccardo (volans) - https://phabricator.wikimedia.org/T208726 (10ayounsi) 05Open>03Resolved Pushed everywhere except Frack infra as I don't want to make any change there during the fundraising campaigns without approval. [23:30:59] (03PS1) 10Dzahn: iegreview: add stretch/php7 support [puppet] - 10https://gerrit.wikimedia.org/r/475015 (https://phabricator.wikimedia.org/T210008) [23:32:05] 10Operations, 10Maps, 10Traffic, 10Reading-Infrastructure-Team-Backlog (Kanban): Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Gehel) >>! In T186732#4762745, @Mholloway wrote: > @Gehel, do you mean specifically the additional EventBus load generated by activat... [23:34:19] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:34:24] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13625/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/475015 (https://phabricator.wikimedia.org/T210008) (owner: 10Dzahn) [23:39:39] (03PS1) 10Dzahn: racktables: add stretch/php7 support [puppet] - 10https://gerrit.wikimedia.org/r/475018 (https://phabricator.wikimedia.org/T210008) [23:40:03] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [23:43:24] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13626/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/475018 (https://phabricator.wikimedia.org/T210008) (owner: 10Dzahn) [23:55:35] (03PS1) 10Dzahn: webserver_misc_apps: require libapache2-mod-php7.0 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/475021 (https://phabricator.wikimedia.org/T210008) [23:57:54] (03CR) 10Dzahn: [C: 032] webserver_misc_apps: require libapache2-mod-php7.0 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/475021 (https://phabricator.wikimedia.org/T210008) (owner: 10Dzahn) [23:59:28] 10Operations, 10Scap, 10Release-Engineering-Team (Watching / External): Scap deployers should have the ability to depool and restart HHVM - https://phabricator.wikimedia.org/T208813 (10greg)