[00:00:16] (03CR) 10BryanDavis: mwvagrant: Add sudoer rules for `vagrant up` (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/424481 (owner: 10BryanDavis) [00:35:32] 10Operations, 10Ops-Access-Requests: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4100865 (10Pine) @Jonas I am a little lost here, perhaps partly because I'm insufficiently familiar with WMF Analytics infrastructure, but it's unclear to me from thi... [02:57:09] (03PS2) 10BryanDavis: wiki replicas: drop views with missing tables [puppet] - 10https://gerrit.wikimedia.org/r/424166 (https://phabricator.wikimedia.org/T191387) [02:58:08] (03CR) 10BryanDavis: wiki replicas: drop views with missing tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424166 (https://phabricator.wikimedia.org/T191387) (owner: 10BryanDavis) [03:26:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 832.55 seconds [04:01:09] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 147.29 seconds [05:04:43] 10Operations, 10Collaboration-Team-Triage, 10DBA, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#4110960 (10jcrespo) [05:04:46] 10Operations, 10DBA, 10Patch-For-Review: Create a full backup of all external storage records that would be easy to restore/setup a temporary delayed slave - https://phabricator.wikimedia.org/T153440#4110958 (10jcrespo) 05Open>03Resolved es1 is in es2003/es2003 as a binary copy. es2 and es3 is in es2002... [05:07:03] 10Operations, 10Collaboration-Team-Triage, 10DBA, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#4110961 (10jcrespo) Let's proceed with this, also let's create new clusters to avoid over-sized table... [05:16:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424520 [05:16:54] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424520 [05:20:11] 10Operations, 10ops-codfw, 10DBA: db2069 RAID with predictive failure - https://phabricator.wikimedia.org/T191593#4110966 (10Marostegui) [05:20:22] 10Operations, 10ops-codfw, 10DBA: db2069 RAID with predictive failure - https://phabricator.wikimedia.org/T191593#4110978 (10Marostegui) p:05Triage>03Normal [05:20:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424520 (owner: 10Marostegui) [05:22:04] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424520 (owner: 10Marostegui) [05:22:06] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424520 (owner: 10Marostegui) [05:24:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 after alter table (duration: 00m 55s) [05:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424521 (https://phabricator.wikimedia.org/T187089) [05:32:59] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424521 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:33:43] (03PS2) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424521 (https://phabricator.wikimedia.org/T187089) [05:35:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424521 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:36:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424521 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:38:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424521 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:44:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 for alter table (duration: 00m 53s) [05:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:10] !log Deploy schema change on db1114 - T187089 T185128 T153182 [05:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:44:17] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [05:44:17] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:44:17] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:56:28] (03PS1) 10Marostegui: db2046.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/424522 (https://phabricator.wikimedia.org/T191275) [05:56:49] (03PS1) 10Marostegui: db-codfw.php: db2046 is s6 candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424523 (https://phabricator.wikimedia.org/T191275) [05:58:57] (03CR) 10Marostegui: [C: 032] db2046.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/424522 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [05:59:52] !log Restart MySQL on db2046 to change its binlog format - T191275 [05:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:59] T191275: Prepare and indicate proper master db failover candidates for all codfw database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T191275 [06:02:58] (03CR) 10Marostegui: [C: 032] db-codfw.php: db2046 is s6 candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424523 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:04:14] (03Merged) 10jenkins-bot: db-codfw.php: db2046 is s6 candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424523 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:05:27] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2046 as candidate master (duration: 00m 59s) [06:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:32] (03CR) 10jenkins-bot: db-codfw.php: db2046 is s6 candidate master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424523 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:27:12] (03PS1) 10Marostegui: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424524 (https://phabricator.wikimedia.org/T191275) [06:27:59] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:28:29] PROBLEM - puppet last run on labcontrol1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/enable-puppet] [06:29:38] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424524 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:29:51] (03PS1) 10Marostegui: db2047.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/424525 (https://phabricator.wikimedia.org/T191275) [06:30:41] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424524 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:30:56] (03CR) 10jenkins-bot: db-codfw.php: Depool db2047 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424524 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:32:33] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2047 to change binlog format, upgrade mariadb and kernel (duration: 00m 59s) [06:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:23] !log Stop MySQL on db2047 for binlog format change, upgrade kernel and mariadb [06:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:18] (03CR) 10Marostegui: [C: 032] db2047.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/424525 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [06:50:35] (03PS1) 10Marostegui: db-codfw.php: db2047 is now candidate master in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424526 (https://phabricator.wikimedia.org/T191275) [06:57:59] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:20] RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:08:46] 10Operations, 10wikidiff2, 10Patch-For-Review, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4111055 (10MoritzMuehlenhoff) Usually spread out over the course of a few days, especially when we need to prune the bytecode cache. [07:10:26] (03CR) 10Joal: "We need companion patches so that oozie SLA reflect this change. Current conf sets he SLA alarm to month+36days, which would be to low if " [puppet] - 10https://gerrit.wikimedia.org/r/424473 (owner: 10Nuria) [07:23:10] (03CR) 10Marostegui: [C: 032] db-codfw.php: db2047 is now candidate master in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424526 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [07:24:25] (03Merged) 10jenkins-bot: db-codfw.php: db2047 is now candidate master in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424526 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [07:26:46] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2047 after changing binlog format, upgrade mariadb and kernel (duration: 00m 59s) [07:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:19] (03CR) 10jenkins-bot: db-codfw.php: db2047 is now candidate master in s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424526 (https://phabricator.wikimedia.org/T191275) (owner: 10Marostegui) [07:34:56] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-fgiunchedi: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633#4111106 (10fgiunchedi) 05Open>03Resolved This is complete! [07:37:10] 10Operations: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081#4111109 (10fgiunchedi) A note on the partitioning since it changed from the last batch of dell ms-be: we're standardizing on `sda` and `sdb` being the SSD disks and the first two in boot order, with the spinnin... [07:40:13] !log removed mediawiki-deployment07 from deployment-prep (T191578) [07:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:19] T191578: deployment-mediawiki07: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - https://phabricator.wikimedia.org/T191578 [07:54:02] (03CR) 10Marostegui: [C: 032] "I have done a few more tests and they all looked good!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [07:54:57] (03CR) 10Marostegui: [V: 032 C: 032] Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 (owner: 10Rduran) [07:56:30] PROBLEM - DPKG on labtestweb2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:59:30] PROBLEM - Check systemd state on labtestweb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:00:16] !log Stop MySQL on db1114 for kernel and mariadb upgrade [08:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:23] labtestweb is me [08:06:25] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#4111142 (10fgiunchedi) a:05fgiunchedi>03None Indeed, I'm removing myself as assignee since I... [08:07:28] !log upload prometheus-burrow-exporter 0.0.5 to jessie/stretch-wikimedia - T188719 [08:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:34] T188719: Upgrade Kafka Burrow to 1.0 - https://phabricator.wikimedia.org/T188719 [08:07:50] !log upgrade prometheus-burrow-exporter on kafkamon1001/2001 - T188719 [08:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:19] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424536 [08:11:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424536 (owner: 10Marostegui) [08:11:00] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[python-ldap],Package[uwsgi] [08:12:12] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424536 (owner: 10Marostegui) [08:12:26] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424536 (owner: 10Marostegui) [08:13:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 (duration: 00m 59s) [08:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:11] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#3961955 (10elukey) Nothing varnish-related happened on Feb 6th as far as I can see from the ops SAL: https://tools.wmflabs.org/sal/production... [08:21:29] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424537 [08:28:14] !log installing apache security updates on releases.wikimedia.org [08:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424537 (owner: 10Marostegui) [08:30:24] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424537 (owner: 10Marostegui) [08:30:39] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424537 (owner: 10Marostegui) [08:31:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1114 (duration: 00m 59s) [08:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:23] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424538 [08:36:20] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[liblua5.1.0-dev] [08:41:58] !log installing apache security updates on mwlog* [08:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424538 (owner: 10Marostegui) [08:43:30] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424538 (owner: 10Marostegui) [08:43:45] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424538 (owner: 10Marostegui) [08:44:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1114 (duration: 00m 59s) [08:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:42] !log installed apache updates to gerrit2001/cobalt [08:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:49] moritzm: ^ [08:48:57] ack, thanks [08:49:59] !log installing apache security updates on prometheus hosts [08:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:52] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424540 [08:51:26] 10Operations, 10Ops-Access-Requests: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4111206 (10Aklapper) @Pine: That was already answered in the previous comment by Jonas, I'd say. [08:55:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424540 (owner: 10Marostegui) [08:56:45] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424540 (owner: 10Marostegui) [08:57:56] !log gerrit: restarting services to pick up openjdk updates [08:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1114 (duration: 00m 59s) [09:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:47] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424541 [09:01:20] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:02:11] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [09:02:41] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 3 minutes ago with 5 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [09:03:27] ^ And those failures is why I want an internal-facing git.svc.*.wmnet or somesuch for puppet and other things to pull from when gerrit is unavailable temporarily [09:03:33] s/is why/are why/ [09:05:16] no_justification: could an external redundancy be used instead? [09:05:53] Like fall back to github? [09:06:03] multiple frontends, even they hav to be active-passive so a rolling restart can be made, one at a time [09:06:22] Ah. Ideally yes. But there's a couple of major blockers [09:06:42] I understand, I just was throwing an idea instead of an internal one [09:07:07] shared cache store (right now H2 on-disk, I'd like to swap for Redis or somesuch) [09:07:07] shared indexing (right now Lucene on-disk, there's work to support elasticsearch...it sucks right now tho) [09:07:22] Also: backing the actual git directories with swift or somesuch [09:07:27] ^ Last one is the big one, obvs [09:07:35] yeah, sharing files is non-obious [09:07:42] if app doesn't support remote storing [09:08:06] I also heard database support is going to be dropped [09:08:19] It probably does, to some degree. I'm /sure/ google has something they use here [09:08:23] so even state/config will be local [09:08:24] (Gerrit has a lot of DI) [09:08:56] Yeah, the last ~3 years they've slowly been killing the database (user account info is now in All-Users.git, rather than `reviewdb.accounts`) [09:10:56] also there is not a good way to perform backups [09:11:29] I reverted some db config, but if data is not there, it is useless to do database backsup (other than full disaster) [09:11:51] Preachin' to the choir. I had some rather...animated...discussions with upstream about this $YEARS ago [09:22:05] (03CR) 10Filippo Giunchedi: "> Patch Set 26:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [09:24:04] !log installing apache security updates on planet1001/planet.wikimedia.org [09:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424541 (owner: 10Marostegui) [09:25:20] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install wdqs200[4-6] - https://phabricator.wikimedia.org/T187800#4111330 (10Gehel) 05Open>03Resolved Service configuration is tracked in T187766, this can be closed. [09:25:36] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424541 (owner: 10Marostegui) [09:25:57] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#4111339 (10Gehel) [09:26:02] (03PS1) 10Rduran: Make sure the dblist file has the proper extension [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424545 [09:26:25] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#3840568 (10Gehel) [09:26:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1114 (duration: 00m 59s) [09:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:19] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111348 (10ema) >>! In T187014#4110582, @Nuria wrote: > Varnish5 rollout might have something to do with this? https://gerrit.wikimedia.org/r... [09:28:17] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424541 (owner: 10Marostegui) [09:31:47] PROBLEM - clamd running on mendelevium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (clamav), command name clamd [09:31:57] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:32:17] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:32:47] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:33:48] RECOVERY - clamd running on mendelevium is OK: PROCS OK: 1 process with UID = 111 (clamav), command name clamd [09:33:57] RECOVERY - Check systemd state on mendelevium is OK: OK - running: The system is fully operational [09:36:28] hmmh, clamd was oom-killed on mendelevium [09:37:35] huge load /CPU spike preceding that [09:38:25] (03CR) 10Jcrespo: [C: 031] Make sure the dblist file has the proper extension [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424545 (owner: 10Rduran) [09:39:15] !log Deploy test alter table on db2038 to test osc_host.py in core [09:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:55] (03CR) 10Jcrespo: "Because we do not yet have a proper way to do a setup of mariadb/mysql, could you document the assumptions for testing? Maybe it was done " [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 (owner: 10Rduran) [09:42:24] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/424546 (https://phabricator.wikimedia.org/T135991) [09:45:01] !log installing apache security updates on graphite hosts [09:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:56] (03CR) 10Rduran: "> Because we do not yet have a proper way to do a setup of" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 (owner: 10Rduran) [09:55:00] (03CR) 10Jcrespo: "> There are no assumptions at all (apart from having the test" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 (owner: 10Rduran) [09:58:25] 10Operations, 10wikidiff2, 10Patch-For-Review, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4111420 (10Lea_WMDE) Technically, there should not be a problem with gradual updates. From the product perspective, users should ideall... [10:09:09] 10Operations, 10Dumps-Generation, 10Patch-For-Review: data retrieval/write issues via NFS on dumpsdata1001, impacting some dump jobs - https://phabricator.wikimedia.org/T191177#4111445 (10ArielGlenn) For people keeping track, I'm trying a couple scripts that create/read/write/remove large numbers of files in... [10:14:28] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4111464 (10Deskana) [10:16:01] (03CR) 10Marostegui: [V: 032 C: 032] "Works as intended" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424545 (owner: 10Rduran) [10:17:33] (03PS1) 10Marostegui: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424550 [10:19:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424550 (owner: 10Marostegui) [10:20:22] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424550 (owner: 10Marostegui) [10:20:35] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1114 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424550 (owner: 10Marostegui) [10:21:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1114 (duration: 01m 00s) [10:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:21] (03CR) 10EddieGP: apache, wwwportals: De-duplicate vhost code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/397770 (owner: 10EddieGP) [10:38:57] (03PS1) 10Muehlenhoff: Remove Varnish config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424552 (https://phabricator.wikimedia.org/T188062) [10:38:59] (03PS1) 10Muehlenhoff: Remove LVS/pybal config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424553 (https://phabricator.wikimedia.org/T188062) [10:39:34] (03CR) 10jerkins-bot: [V: 04-1] Remove LVS/pybal config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424553 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [10:48:58] (03PS1) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [10:49:55] (03PS1) 10Rduran: Make WMFMariaDB.py and recover_section.py flake8 compliant [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424558 [10:50:18] (03PS2) 10Muehlenhoff: Remove LVS/pybal config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424553 (https://phabricator.wikimedia.org/T188062) [10:51:38] (03PS2) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [10:55:15] (03PS1) 10Rduran: Add tests for the argument parsing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424560 [10:57:30] (03CR) 10Jcrespo: "This is very cool! - but let me update recover_section.py from the HEAD of the other repo first!" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424558 (owner: 10Rduran) [11:00:13] hey, toolforge puppetmaster is broken wrt. puppet, any idea? [11:00:17] Apr 6 06:25:42 tools-puppetmaster-01 apache2[22572]: AH00526: Syntax error on line 8 of /etc/apache2/sites-enabled/50-puppetmaster-wikimedia-org.conf: [11:00:18] Apr 6 06:25:42 tools-puppetmaster-01 apache2[22572]: Invalid command 'SSLOpenSSLConfCmd', perhaps misspelled or defined by a module not included in the server configuration [11:03:34] (03PS1) 10Jcrespo: dump_section.py: Rename dump_sections to singular, update to HEAD [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424564 [11:04:57] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Long-lived cherry-picks on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T191294#4111574 (10EddieGP) That check is a joke. It's there because we don't want long-lived cherry-picks on the puppetm... [11:06:02] RECOVERY - Check systemd state on labtestweb2001 is OK: OK - running: The system is fully operational [11:06:12] RECOVERY - DPKG on labtestweb2001 is OK: All packages OK [11:06:41] (03PS3) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [11:07:11] arturo: hi, that's https://phabricator.wikimedia.org/T159254 [11:07:38] (03CR) 10Jcrespo: [V: 032 C: 032] dump_section.py: Rename dump_sections to singular, update to HEAD [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424564 (owner: 10Jcrespo) [11:07:49] seems like Antoine has created a patch which he then abandoned with "Generalized by Arturo with https://gerrit.wikimedia.org/r/#/c/389480/" :-) [11:09:23] thanks moritzm [11:12:24] moritzm: seems like a timming issue: `Unpacking apache2 (2.4.10-10+deb8u12) over (2.4.10-10+deb8u11+wmf1)` but then we have `2.4.10-10+deb8u12` and `2.4.10-10+deb8u12+wmf1`, the latter with higher pinning [11:12:51] (03PS2) 10Jcrespo: Make WMFMariaDB.py and recover_section.py flake8 compliant [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424558 (owner: 10Rduran) [11:13:05] could be, yes [11:13:46] upgrading the toolforge puppet masterts to stretch would also fix this (the puppetmasters in production are based on stretch since a few weeks) [11:15:07] (03CR) 10Jcrespo: [C: 031] "Looks good to me now, although maybe the recover_section.py change is unnecessary? Some style errors had been solved on the latest version" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424558 (owner: 10Rduran) [11:16:35] yeah, we have that on our TODO [11:17:10] (03PS2) 10Arturo Borrero Gonzalez: mwvagrant: Add sudoer rules for `vagrant up` [puppet] - 10https://gerrit.wikimedia.org/r/424481 (owner: 10BryanDavis) [11:17:54] (03CR) 10Arturo Borrero Gonzalez: [C: 032] mwvagrant: Add sudoer rules for `vagrant up` [puppet] - 10https://gerrit.wikimedia.org/r/424481 (owner: 10BryanDavis) [11:20:48] (03PS4) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [11:21:37] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Long-lived cherry-picks on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T191294#4111635 (10MarcoAurelio) Maybe reset and repull everything or there are changes that were directly done on the pu... [11:24:28] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4111648 (10Nirmos) Not sure I understand your question. Are you asking how to fix the lint errors so that wikis can switch from Tidy to R... [11:26:40] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4111651 (10Elitre) >>! In T133410#4111612, @Zoranzoki21 wrote: > Hi, > happy holidays! I have question. How to we fix tags so migration c... [11:31:54] (03PS5) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [11:36:11] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:37:35] (03PS6) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [11:39:35] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Long-lived cherry-picks on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T191294#4111688 (10EddieGP) The check asks "Is the average number of cherry-picks on the puppet master in the last 48h gr... [11:41:10] !log upgrading deployment-prep to wikidiff2 1.6.0 (T190717) [11:41:11] T190717: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717 [11:41:36] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111691 (10mbaluta) Please note that number of page views prior to 6th February seems incorrect from our perspective too - number of Opera Mi... [11:42:15] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4111693 (10Tgr) [11:42:22] PROBLEM - Check systemd state on labtestweb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:43:13] (03PS2) 10Mark Bergsma: Create FSM test cases according to the RFC 4271 definition [debs/pybal] - 10https://gerrit.wikimedia.org/r/423995 [11:43:37] (03PS7) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [11:45:52] (03CR) 10Mark Bergsma: Introduce server.is_pooled and make server.pooled usage more consistent (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053 (owner: 10Mark Bergsma) [11:46:17] (03CR) 10Rduran: "> Looks good to me now, although maybe the recover_section.py change" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424558 (owner: 10Rduran) [11:46:40] (03CR) 10Jcrespo: [C: 031] mediawiki: Start deleteAutoPatrolLogs from Wikidata logging table [puppet] - 10https://gerrit.wikimedia.org/r/424300 (https://phabricator.wikimedia.org/T189596) (owner: 10Ladsgroup) [11:48:26] If anyone has time to puppet-merge a one-line beta-only change just for the sake of not having it as cherry-pick but properly merged in beta, it'd be appreciated. :) https://gerrit.wikimedia.org/r/c/424361/ [11:50:10] !log start of ladsgroup@terbium:~$ mwscript deleteAutoPatrolLogs.php --wiki=fawiki --before 20180223210426 --sleep 2 (T184485) [11:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:17] T184485: Stop logging autopatrol actions - https://phabricator.wikimedia.org/T184485 [11:51:32] (03CR) 10EddieGP: "Puppet compiler should verify this does not actually change anything in the configuration file." [puppet] - 10https://gerrit.wikimedia.org/r/424371 (owner: 10EddieGP) [11:59:42] (03PS8) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [12:10:39] (03PS9) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [12:15:21] (03PS10) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [12:16:59] 10Operations, 10wikidiff2, 10Patch-For-Review, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4111781 (10MoritzMuehlenhoff) >>! In T190717#4111420, @Lea_WMDE wrote: > > Could you put a link to beta here, once it is live there?... [12:18:55] (03PS11) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [12:24:38] (03PS12) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [12:32:51] (03PS13) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [12:37:32] (03PS14) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [12:38:19] !log installing apache security updates on the Kibana nodes of the logstash cluster [12:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:55] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-IRC-RC-Server: Create RC feed for login.wikimedia - https://phabricator.wikimedia.org/T191625#4111859 (10MarcoAurelio) [12:50:41] (03PS15) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [12:51:23] (03PS2) 10Mark Bergsma: Handle non-IDLE states in idleHoldTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/423997 [12:51:25] (03PS2) 10Mark Bergsma: Fix sendNotification invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/423998 [12:51:27] (03PS2) 10Mark Bergsma: Fix two typos in bgp.FSM.openReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/423999 [12:51:29] (03PS2) 10Mark Bergsma: Fix holdTimeEvent incrementing connectRetryCounter twice [debs/pybal] - 10https://gerrit.wikimedia.org/r/424000 [12:51:31] (03PS2) 10Mark Bergsma: Fix distinction between events 19 and 20 (delayOpen) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424001 [12:51:33] (03PS2) 10Mark Bergsma: Handle state ESTABLISHED in versionError (event 24) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424002 [12:51:35] (03PS2) 10Mark Bergsma: Handle state OPENSENT in keepAliveEvent (event 11) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424003 [12:51:38] (03PS2) 10Mark Bergsma: Handle state OPENSENT in keepAliveReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/424004 [12:51:40] (03PS2) 10Mark Bergsma: Correctly handle event 9 (connectRetryTimeEvent) in ACTIVE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424005 [12:51:42] (03PS2) 10Mark Bergsma: Fix typo in FSM.delayOpenTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/424006 [12:51:44] (03PS2) 10Mark Bergsma: Move updating of FSM metric labels to the protocol's connectionMade [debs/pybal] - 10https://gerrit.wikimedia.org/r/424007 [12:51:46] (03PS2) 10Mark Bergsma: Ignore headerError and openMessageError in state IDLE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424008 [12:51:48] (03PS2) 10Mark Bergsma: Cleanup module for consistency [debs/pybal] - 10https://gerrit.wikimedia.org/r/424009 [12:51:50] (03PS2) 10Mark Bergsma: Fix test case ESTABLISHED event 27 hold time nonzero [debs/pybal] - 10https://gerrit.wikimedia.org/r/424010 [12:51:52] (03PS2) 10Mark Bergsma: Add test cases for implemented event 25 and fix OPENSENT [debs/pybal] - 10https://gerrit.wikimedia.org/r/424011 [12:52:31] (03PS1) 10Mark Bergsma: Move BGP constants into their own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424580 [12:52:33] (03PS1) 10Mark Bergsma: Split off bgp.FSM into its own module [debs/pybal] - 10https://gerrit.wikimedia.org/r/424581 [12:54:42] (03CR) 10jerkins-bot: [V: 04-1] Fix two typos in bgp.FSM.openReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/423999 (owner: 10Mark Bergsma) [13:00:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:01:37] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) [13:01:55] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:02:26] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) [13:02:56] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:04:10] (03CR) 10Ottomata: "> inside Prometheus' static_config yaml file" [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [13:04:35] godog: ^ [13:05:46] (03PS16) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [13:06:59] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111935 (10ema) >>! In T187014#4111691, @mbaluta wrote: > If you provided IP address of our server, we could at least tell whether it is comi... [13:08:00] (03CR) 10Ottomata: "Prometheus is using to pull together all these yaml files." [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [13:08:41] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:12:07] (03CR) 10Ottomata: [C: 031] ":)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424557 (owner: 10Elukey) [13:12:28] ottomata: D: [13:12:30] (03PS17) 10Elukey: [WIP] burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 [13:12:59] ottomata: the 1.0 tag does not support more than 0.11.0, the devs suggests to use 0.11 for the moment that should be ok [13:13:14] ah ok [13:13:15] cool [13:13:33] (03PS1) 10Gehel: wdqs: new wdqs-internal service [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766) [13:13:47] I am checking all the new generated config files but I should be almost done :) [13:14:07] !log installing apache security updates on thorium (running several analytics web services) [13:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:16] (03PS18) 10Elukey: burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 (https://phabricator.wikimedia.org/T188719) [13:24:09] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:24:20] (03CR) 10Elukey: "Pcc: https://puppet-compiler.wmflabs.org/compiler02/10846/" [puppet] - 10https://gerrit.wikimedia.org/r/424557 (https://phabricator.wikimedia.org/T188719) (owner: 10Elukey) [13:24:40] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:33:00] * moritzm sighs are our puppet CI tests [13:33:11] RECOVERY - Check systemd state on labtestweb2001 is OK: OK - running: The system is fully operational [13:36:12] PROBLEM - Check systemd state on labtestweb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:43:49] [Wsd5hwpAAEAAAGzVEucAAABY] 2018-04-06 13:43:35: Fatal exception of type "Wikimedia\Assert\ParameterTypeException" [13:43:56] (03PS4) 10EddieGP: wwwportals: De-duplicate apache vhost code [puppet] - 10https://gerrit.wikimedia.org/r/397770 [13:53:47] (03PS1) 10Andrew Bogott: striker: mask out default uwsgi service [puppet] - 10https://gerrit.wikimedia.org/r/424591 [13:55:09] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-IRC-RC-Server: Create RC feed for login.wikimedia - https://phabricator.wikimedia.org/T191625#4112075 (10MarcoAurelio) Well apparently `#login.wikipedia` exists. Maybe we should either rename the channel or set a redirect. [13:55:53] !log temporarily disabling puppet agents for apache security update on puppet masters [13:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:09] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4112076 (10ema) @mbaluta: note that the problem I've mentioned in my comment above is probably unrelated to the stats issue discussed here (w... [14:00:47] (03PS1) 10Ottomata: Install the Spark 2 yarn shuffle service jar over Spark 1's [puppet] - 10https://gerrit.wikimedia.org/r/424593 (https://phabricator.wikimedia.org/T159962) [14:01:17] (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 26:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [14:01:23] (03CR) 10jerkins-bot: [V: 04-1] Install the Spark 2 yarn shuffle service jar over Spark 1's [puppet] - 10https://gerrit.wikimedia.org/r/424593 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [14:02:51] (03PS2) 10Ottomata: Install the Spark 2 yarn shuffle service jar over Spark 1's [puppet] - 10https://gerrit.wikimedia.org/r/424593 (https://phabricator.wikimedia.org/T159962) [14:03:30] !log apache updated on puppet masters — re-enabling puppet agents [14:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:01] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10847/analytics1050.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/424593 (https://phabricator.wikimedia.org/T159962) (owner: 10Ottomata) [14:06:11] (03PS1) 10Muehlenhoff: Remove obsolete Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) [14:06:15] (03PS1) 10Krinkle: navtiming: Remove broken 'rendering' and 'pageSpeed' metrics [puppet] - 10https://gerrit.wikimedia.org/r/424595 (https://phabricator.wikimedia.org/T104902) [14:06:32] (03CR) 10Ema: [C: 04-1] "imagescaler.yaml shouldn't be removed here (it's not varnish stuff)." [puppet] - 10https://gerrit.wikimedia.org/r/424552 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [14:07:05] !log Running populateArchiveRevId.php for group2 for T191307 [14:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:12] T191307: Run maintenance/populateArchiveRevId.php on all wikis - https://phabricator.wikimedia.org/T191307 [14:08:08] !log upgraded apache on fermium for security updates [14:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:04] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me, one nit." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424591 (owner: 10Andrew Bogott) [14:10:24] 10Operations, 10Ops-Access-Requests, 10Analytics: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4112106 (10herron) [14:11:42] (03PS2) 10Andrew Bogott: striker: mask out default uwsgi service [puppet] - 10https://gerrit.wikimedia.org/r/424591 [14:12:50] (03CR) 10Muehlenhoff: [C: 031] striker: mask out default uwsgi service [puppet] - 10https://gerrit.wikimedia.org/r/424591 (owner: 10Andrew Bogott) [14:13:22] (03CR) 10Andrew Bogott: [C: 032] striker: mask out default uwsgi service [puppet] - 10https://gerrit.wikimedia.org/r/424591 (owner: 10Andrew Bogott) [14:16:22] (03PS1) 10Arturo Borrero Gonzalez: wmcs: monitoring: purge archived metrics every 90 days [puppet] - 10https://gerrit.wikimedia.org/r/424597 (https://phabricator.wikimedia.org/T190512) [14:17:22] (03PS2) 10Arturo Borrero Gonzalez: wmcs: monitoring: purge archived metrics every 90 days [puppet] - 10https://gerrit.wikimedia.org/r/424597 (https://phabricator.wikimedia.org/T190512) [14:18:10] (03CR) 10Arturo Borrero Gonzalez: [C: 032] wmcs: monitoring: purge archived metrics every 90 days [puppet] - 10https://gerrit.wikimedia.org/r/424597 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [14:18:15] (03CR) 10Imarlier: [C: 031] navtiming: Remove broken 'rendering' and 'pageSpeed' metrics [puppet] - 10https://gerrit.wikimedia.org/r/424595 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [14:32:37] (03PS1) 10Gehel: wdqs: LVS and conftool configuration for new wdqs-internal service [puppet] - 10https://gerrit.wikimedia.org/r/424599 (https://phabricator.wikimedia.org/T187766) [14:32:41] (03PS1) 10Herron: puppetmaster: repool rhodium [puppet] - 10https://gerrit.wikimedia.org/r/424600 [14:34:10] (03CR) 10Herron: [C: 032] puppetmaster: repool rhodium [puppet] - 10https://gerrit.wikimedia.org/r/424600 (owner: 10Herron) [14:35:54] issues with replication? [14:37:01] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-ircd-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424602 (https://phabricator.wikimedia.org/T135991) [14:37:13] any deployment ongoing? [14:37:29] !log repooled rhodium (puppet master backend) [14:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:16] lag start at 14:24 [14:38:32] traffic tripled [14:38:35] (writes) [14:38:38] (03PS1) 10Arturo Borrero Gonzalez: toollabs: apt_pinning: add apache2 package pinning [puppet] - 10https://gerrit.wikimedia.org/r/424603 (https://phabricator.wikimedia.org/T159254) [14:39:22] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: apt_pinning: add apache2 package pinning [puppet] - 10https://gerrit.wikimedia.org/r/424603 (https://phabricator.wikimedia.org/T159254) (owner: 10Arturo Borrero Gonzalez) [14:42:07] updates on enwiki at 1000x [14:43:15] jynus: group2 wikis are being updated by a script iirc [14:43:49] Hauskatze: which one? [14:43:51] 'populateArchiveRevId.php for group2 for T191307' says SAL [14:43:52] T191307: Run maintenance/populateArchiveRevId.php on all wikis - https://phabricator.wikimedia.org/T191307 [14:43:53] " !--log Running populateArchiveRevId.php for group2 for T191307 [14:44:10] anomie: ^lag one codfw [14:44:11] hi eddiegp [14:44:18] hi Hauskatze :) [14:44:25] I trust anomie, then [14:44:35] don't we all? :) [14:44:44] if he is ok with degraded state on codfw, it is ok [14:44:49] but we mat get alerts [14:44:56] ping him about that [14:44:59] bye! [14:45:29] I don't have a say about that or anything [14:45:30] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=1523022325566&to=1523025925566&var-dc=eqiad%20prometheus%2Fops&var-server=db1052&var-port=9104&panelId=33&fullscreen [14:45:42] *may get alerts [14:46:31] I dont like running those on friday- no deployments are done for a reason [14:47:07] enwiki just finished, FYI. Now the script is doing eswiki. [14:47:08] my advice would be to tune down the alerts or include the waitforslaves to codfw (higher latency) [14:47:13] ^ [14:47:17] nothings else to add [14:47:29] (that only for non-essential writes) [14:47:37] see you! [14:47:54] (03PS19) 10Elukey: burrow: configuration upgrade to support 1.0 [puppet] - 10https://gerrit.wikimedia.org/r/424557 (https://phabricator.wikimedia.org/T188719) [14:48:29] The script does wfWaitForSlaves in between each batch of 100 rows processed. I suppose that probably doesn't wait for cross-DC replication though. [14:50:01] PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.07 seconds [14:50:02] PROBLEM - MariaDB Slave Lag: s1 on db2072 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.14 seconds [14:50:12] PROBLEM - MariaDB Slave Lag: s1 on db2062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.03 seconds [14:50:22] PROBLEM - MariaDB Slave Lag: s1 on db2069 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.94 seconds [14:50:32] PROBLEM - MariaDB Slave Lag: s1 on db2070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.94 seconds [14:50:41] PROBLEM - MariaDB Slave Lag: s1 on db2088 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.01 seconds [14:50:46] uh oh [14:50:51] PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.32 seconds [14:50:53] as expected [14:50:58] yep [14:51:00] eswiki [14:51:01] PROBLEM - MariaDB Slave Lag: s1 on db2071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.84 seconds [14:51:01] PROBLEM - MariaDB Slave Lag: s1 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.92 seconds [14:51:12] (03PS1) 10Vgutierrez: varnish: varnishxcache post-removal cleanup [puppet] - 10https://gerrit.wikimedia.org/r/424611 (https://phabricator.wikimedia.org/T184942) [14:51:13] needed job, ignore them [14:51:21] The script is on frwiki now [14:51:38] (03CR) 10Vgutierrez: varnish: Remove varnishxcache python daemon (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421925 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:53:21] RECOVERY - MariaDB Slave Lag: s1 on db2062 is OK: OK slave_sql_lag Replication lag: 57.02 seconds [14:53:31] RECOVERY - MariaDB Slave Lag: s1 on db2069 is OK: OK slave_sql_lag Replication lag: 30.46 seconds [14:53:32] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 16.00 seconds [14:53:41] RECOVERY - MariaDB Slave Lag: s1 on db2088 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:53:51] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:53:56] I don't think MediaWiki running from eqiad (terbium) even knows that the codfw servers exist to try to wait for their lag. [14:54:01] RECOVERY - MariaDB Slave Lag: s1 on db2071 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:54:01] RECOVERY - MariaDB Slave Lag: s1 on db2085 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:54:02] RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:54:11] RECOVERY - MariaDB Slave Lag: s1 on db2072 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [14:54:31] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4112273 (10Vgutierrez) The issue has been solved on pybal 1.15.3 available for stretch [14:55:07] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4112274 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [15:17:25] (03PS10) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [15:17:27] (03PS3) 10Rduran: Make WMFMariaDB.py flake8 compliant [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424558 [15:18:13] 10Operations, 10Collaboration-Team-Triage, 10DBA, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#4112346 (10Anomie) This is probably repeating what everyone already knows, but just in case I believe... [15:19:18] (03PS1) 10Elukey: Release Burrow 1.0 [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/424615 (https://phabricator.wikimedia.org/T188719) [15:21:00] (03PS10) 10Rduran: Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [15:22:52] (03PS2) 10Rduran: Add tests for the argument parsing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424560 [15:27:30] (03CR) 10Rduran: "This should be ready to be merged now. I added a README.md with instructions." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 (owner: 10Rduran) [15:28:33] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4112360 (10Nuria) @ema: > I think we should try to debug the code that sets Country to "United States" for User-Agent: ~ "Opera Mini" and see... [15:39:38] (03PS1) 10Gilles: $wgLocalFileRepo definition is DC-dependent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) [15:46:56] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, to be merged/tested during an appropriate window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) (owner: 10Gilles) [15:47:22] 10Operations, 10ops-eqiad, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#4112413 (10RobH) [16:08:29] (03CR) 10Ottomata: "LGTM, but add a debian/README.Debian like https://github.com/wikimedia/operations-debs-spark2/blob/debian/debian/README.Debian with instru" [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/424615 (https://phabricator.wikimedia.org/T188719) (owner: 10Elukey) [16:10:09] (03CR) 10Elukey: "> LGTM, but add a debian/README.Debian like https://github.com/wikimedia/operations-debs-spark2/blob/debian/debian/README.Debian" [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/424615 (https://phabricator.wikimedia.org/T188719) (owner: 10Elukey) [16:21:21] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 67.50, 24.83, 16.39 [16:22:08] (03PS1) 10Rush: openstack: linux bridge agent physical mappings [puppet] - 10https://gerrit.wikimedia.org/r/424621 (https://phabricator.wikimedia.org/T188266) [16:22:21] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 33.93, 23.30, 16.42 [16:22:28] (03PS2) 10Rush: openstack: linux bridge agent physical mappings [puppet] - 10https://gerrit.wikimedia.org/r/424621 (https://phabricator.wikimedia.org/T188266) [16:24:25] (03CR) 10Rush: "labtestvirt2003.codfw.wmnet,labtestneutron2001.codfw.wmnet,labtestmetal2001.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/424621 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:25:55] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#4112513 (10RobH) [16:26:03] commented on wrong task =P [16:26:16] 10Operations, 10ops-eqiad, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3902917 (10RobH) port info: asw-a-eqiad: ge-7/0/33 up up mw1204 ge-7/0/34 up up mw1205 ge-7/0/35 up up mw1206 ge-7/0/36 up up mw1207 asw-b-eqiad: ge-7/0... [16:27:28] (03CR) 10Rush: [V: 032 C: 032] openstack: linux bridge agent physical mappings [puppet] - 10https://gerrit.wikimedia.org/r/424621 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:41:27] (03PS1) 10RobH: decom mw1201-1220 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/424623 (https://phabricator.wikimedia.org/T185004) [16:43:52] (03PS1) 10RobH: decom mw1201-1220 puppet repo entries [puppet] - 10https://gerrit.wikimedia.org/r/424624 (https://phabricator.wikimedia.org/T185004) [16:46:37] (03CR) 10RobH: [C: 032] decom mw1201-1220 prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/424623 (https://phabricator.wikimedia.org/T185004) (owner: 10RobH) [16:47:02] (03CR) 10RobH: [C: 032] decom mw1201-1220 puppet repo entries [puppet] - 10https://gerrit.wikimedia.org/r/424624 (https://phabricator.wikimedia.org/T185004) (owner: 10RobH) [16:49:48] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#4112583 (10RobH) [16:51:04] 10Operations, 10ops-eqiad, 10hardware-requests, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#3902917 (10RobH) So this was ignored for a very long time, due to the fact it lacked #hardware-requests. **ALL decom/reclaim tasks should have #hardware-requests as a project.** [16:52:14] 10Operations, 10ops-eqiad, 10hardware-requests, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#4112589 (10RobH) These are all now ready for disk wipe. [16:54:08] (03PS2) 10Elukey: Release Burrow 1.0 [debs/burrow] (debian) - 10https://gerrit.wikimedia.org/r/424615 (https://phabricator.wikimedia.org/T188719) [16:55:39] 10Operations, 10Ops-Access-Requests, 10Analytics: Access to stat100x and notebook1003.eqiad.wmnet for Jonas Kress - https://phabricator.wikimedia.org/T191308#4112615 (10herron) p:05Triage>03Normal [16:56:17] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: grant thcipriani RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4112621 (10herron) p:05Triage>03Normal [16:57:01] (03PS3) 10Nuria: Moving sqooping of mediawiki to the 5th of month [puppet] - 10https://gerrit.wikimedia.org/r/424473 [16:57:25] (03CR) 10jerkins-bot: [V: 04-1] Moving sqooping of mediawiki to the 5th of month [puppet] - 10https://gerrit.wikimedia.org/r/424473 (owner: 10Nuria) [16:57:31] (03CR) 10Nuria: "Added changes to alarms: https://gerrit.wikimedia.org/r/#/c/424626/" [puppet] - 10https://gerrit.wikimedia.org/r/424473 (owner: 10Nuria) [16:58:04] (03PS4) 10Nuria: Moving sqooping of mediawiki to the 5th of month [puppet] - 10https://gerrit.wikimedia.org/r/424473 [16:58:10] (03CR) 10jerkins-bot: [V: 04-1] Moving sqooping of mediawiki to the 5th of month [puppet] - 10https://gerrit.wikimedia.org/r/424473 (owner: 10Nuria) [17:32:25] (03PS3) 10Dzahn: misc_static_sites: temp disable bromine backend for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/423580 (https://phabricator.wikimedia.org/T18863) [17:34:09] 10Operations, 10Puppet: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4112722 (10Andrew) [17:37:55] 10Operations, 10Traffic, 10Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#4112738 (10MoritzMuehlenhoff) [17:38:00] 10Operations, 10Traffic: Remove 3DES patch from OpenSSL builds - https://phabricator.wikimedia.org/T180792#4112736 (10MoritzMuehlenhoff) 05Open>03Resolved This was resolved in the latest update of our OpenSSL 1.1 packages for jessie-wikimedia [17:42:06] 10Operations, 10Puppet: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4112743 (10Andrew) I can only hope that I'm misdiagnosing this... I'd love it someone else would reproduce this. [17:44:54] (03CR) 10Bstorm: [C: 032] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/424166 (https://phabricator.wikimedia.org/T191387) (owner: 10BryanDavis) [17:45:06] (03PS3) 10Bstorm: wiki replicas: drop views with missing tables [puppet] - 10https://gerrit.wikimedia.org/r/424166 (https://phabricator.wikimedia.org/T191387) (owner: 10BryanDavis) [17:47:51] (03CR) 10Krinkle: Remove obsolete Hiera setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424594 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [17:52:39] (03PS1) 10Arturo Borrero Gonzalez: admin: add /home/aborrero/ personal config [puppet] - 10https://gerrit.wikimedia.org/r/424636 [17:55:01] (03CR) 10Arturo Borrero Gonzalez: [C: 032] admin: add /home/aborrero/ personal config [puppet] - 10https://gerrit.wikimedia.org/r/424636 (owner: 10Arturo Borrero Gonzalez) [17:55:59] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [17:58:39] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/home/aborrero/.bash_profile],File[/home/aborrero/.config/liquidpromptrc],File[/home/aborrero/.liquidprompt/liquidprompt] [17:59:03] ??? [17:59:09] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/aborrero/.liquidprompt/liquidprompt] [17:59:29] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/home/aborrero/.bash_profile],File[/home/aborrero/.bashrc],File[/home/aborrero/.config/liquidpromptrc] [17:59:34] wat [18:00:23] let me look at one of those [18:01:15] I'm at mw1319 [18:01:35] watching syslog there too and on stat1005 running puppet [18:01:54] it's creating some dot files for you and is fine [18:02:18] must have been some race that just happen on one in 500 [18:02:24] and is fixed on next run [18:02:50] stat1005 is the same.. it created the liquidpromptrc and finished without issues [18:02:55] 10Operations, 10Puppet: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4112722 (10bd808) I wonder if the specific ordering issue is the `callable` and `plugins` lines? [18:03:01] yeah [18:03:07] weird, isn't it? [18:03:38] yea, but it's also just the scale of it. 2 in > 2000 [18:04:04] we sometimes see it after changes that affect everything [18:04:09] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:04:11] maybe when icinga checks while the puppet run is ongoing [18:04:29] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:04:53] well, not just icinga. I jumped to one server, and the file wasn't there (my bashrc complained), so half of the patch was applied :-P [18:09:40] (03PS1) 10Andrew Bogott: uwsgi: mangle .ini template to put plugin settings on the top [puppet] - 10https://gerrit.wikimedia.org/r/424638 (https://phabricator.wikimedia.org/T191648) [18:13:23] thanks mutante BTW :-P [18:14:15] (03PS2) 10Andrew Bogott: uwsgi: mangle .ini template to put plugin settings on the top [puppet] - 10https://gerrit.wikimedia.org/r/424638 (https://phabricator.wikimedia.org/T191648) [18:28:39] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:36:09] (03PS1) 10Ayounsi: Logstash: Add initial network syslog parsing [puppet] - 10https://gerrit.wikimedia.org/r/424643 [18:38:19] PROBLEM - Apache HTTP on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:04] (03CR) 10Ayounsi: "Tested as much as possible with: https://grokconstructor.appspot.com/do/match" [puppet] - 10https://gerrit.wikimedia.org/r/424643 (owner: 10Ayounsi) [18:39:10] RECOVERY - Apache HTTP on mw2144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time [18:40:30] hey arturo out of curiosity which system did you run puppet-merge on for that bashrc change? [18:41:30] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [18:43:35] (03PS1) 10Chad: scap clean: Use --delete-excluded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424645 (https://phabricator.wikimedia.org/T157030) [18:44:28] that unmerged changed on rhodium alert I think was related, but curious if there was any error syncing there during the merge or if it happened silently [18:45:55] herron: I think he is away for the weekend now [18:45:56] fyi [18:47:40] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.26 [keeping static files] (duration: 01m 51s) [18:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:00] !log demon@tin scap failed: average error rate on 5/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [18:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:18] thx chasemp. ok I’m going to depool rhodium again as a precaution. puppet-merge looked to have been working (at least from puppetmaster1001) but I don’t want to risk having it go out of sync again as we go into the weekend [18:50:07] thcipriani: Well crap ^^ [18:50:47] that's no good [18:50:49] PROBLEM - Nginx local proxy to apache on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.038 second response time [18:50:50] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1889 bytes in 0.044 second response time [18:50:50] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.039 second response time [18:50:54] (03PS1) 10Herron: Revert "puppetmaster: repool rhodium" [puppet] - 10https://gerrit.wikimedia.org/r/424646 [18:50:59] PROBLEM - Nginx local proxy to apache on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.037 second response time [18:50:59] PROBLEM - HHVM rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.036 second response time [18:51:00] PROBLEM - Apache HTTP on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.077 second response time [18:51:05] That's the canaries ^ [18:51:09] PROBLEM - Apache HTTP on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.029 second response time [18:51:09] PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.033 second response time [18:51:09] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1889 bytes in 0.036 second response time [18:51:09] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1889 bytes in 0.048 second response time [18:51:09] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1889 bytes in 0.057 second response time [18:51:09] PROBLEM - Apache HTTP on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.034 second response time [18:51:10] PROBLEM - HHVM rendering on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.032 second response time [18:51:10] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.045 second response time [18:51:10] phew, ok [18:51:11] PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.035 second response time [18:51:12] (03PS2) 10Herron: Revert "puppetmaster: repool rhodium" [puppet] - 10https://gerrit.wikimedia.org/r/424646 [18:51:17] (03CR) 10jerkins-bot: [V: 04-1] Revert "puppetmaster: repool rhodium" [puppet] - 10https://gerrit.wikimedia.org/r/424646 (owner: 10Herron) [18:51:19] PROBLEM - HHVM rendering on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.035 second response time [18:51:19] PROBLEM - HHVM rendering on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.031 second response time [18:51:19] PROBLEM - Apache HTTP on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.028 second response time [18:51:19] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.035 second response time [18:51:20] PROBLEM - Nginx local proxy to apache on mw1264 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.041 second response time [18:51:29] PROBLEM - Nginx local proxy to apache on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.092 second response time [18:51:29] !log demon@tin Started scap: Forcing full scap, removed clean plugin updates [18:51:29] PROBLEM - Nginx local proxy to apache on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.041 second response time [18:51:29] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1889 bytes in 0.044 second response time [18:51:29] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.032 second response time [18:51:29] PROBLEM - HHVM rendering on mw1276 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.035 second response time [18:51:29] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.039 second response time [18:51:30] PROBLEM - HHVM rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.045 second response time [18:51:30] PROBLEM - Apache HTTP on mw1277 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.031 second response time [18:51:31] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1889 bytes in 0.038 second response time [18:51:31] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.043 second response time [18:51:32] PROBLEM - Apache HTTP on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.035 second response time [18:51:32] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.044 second response time [18:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:39] (03CR) 10jerkins-bot: [V: 04-1] Revert "puppetmaster: repool rhodium" [puppet] - 10https://gerrit.wikimedia.org/r/424646 (owner: 10Herron) [18:51:40] PROBLEM - HHVM rendering on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1884 bytes in 0.031 second response time [18:52:10] thcipriani: I'm afraid of what I might have done to i18n cache. [18:52:18] Doing a full scap to be safe. [18:52:33] (03PS3) 10Herron: Revert "puppetmaster: repool rhodium" [puppet] - 10https://gerrit.wikimedia.org/r/424646 [18:53:16] makes sense [18:53:28] (03CR) 10Herron: [C: 032] Revert "puppetmaster: repool rhodium" [puppet] - 10https://gerrit.wikimedia.org/r/424646 (owner: 10Herron) [18:54:29] Wow. Crap. It does indeed hose all of the *.cdb files [18:54:33] This ain't right at all! [18:54:57] demon@tin /srv/mediawiki/php/cache/l10n (master)$ ls [18:54:57] upstream/ [18:55:00] yeah, l10n cache on mwdebug1002 is sans cdb files [18:55:07] I just noticed [18:55:28] demon@mw1262 /srv/mediawiki/php-1.31.0-wmf.28/cache/l10n$ ls [18:55:29] upstream/ [18:55:32] Same. [18:55:37] Yeah, this is /wrong/ [18:55:39] (03CR) 10Dzahn: [C: 032] misc_static_sites: temp disable bromine backend for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/423580 (https://phabricator.wikimedia.org/T18863) (owner: 10Dzahn) [18:55:41] I'm not landing this in master. [18:56:09] (03PS4) 10Dzahn: misc_static_sites: temp disable bromine backend for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/423580 (https://phabricator.wikimedia.org/T18863) [18:56:13] /15/14 [18:56:30] (03CR) 10Chad: [C: 04-2] "Broke everything. *.cdb files hosed on all targets including the non-specified branches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424645 (https://phabricator.wikimedia.org/T157030) (owner: 10Chad) [18:57:11] thcipriani: Actually, I might have to do a scap pull on each of the affected hosts....we won't get to the scap-rebuild-cdbs stage [18:57:20] so the .cdb files are somehow on the excluded list? [18:57:27] i see the --delete-excluded part [18:57:47] yeah, cdb files are excluded from rsync since they don't rsync well [18:57:48] They're supposed to be on the excluded list [18:57:57] But here, we want to delete them after all! [18:58:12] But there's no such thing as --delete-excluded-except-this-no-really-exclude-this [18:58:22] (03CR) 10Krinkle: [C: 04-1] $wgLocalFileRepo definition is DC-dependent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) (owner: 10Gilles) [18:58:26] (03CR) 10Krinkle: [C: 04-1] "The loop in which the code is placed must only be used to add config, not set config. The current patch appears to overwrite LocalFileRepo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) (owner: 10Gilles) [18:58:28] hmm.. ack [18:58:48] (03CR) 10Krinkle: [C: 04-1] $wgLocalFileRepo definition is DC-dependent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424618 (https://phabricator.wikimedia.org/T191643) (owner: 10Gilles) [18:58:49] I guess I need to target --include a little narrower for clean? I was kinda just relying on a "sync everything" but I can see that as problematic for reasons that are now all too obvious. [18:59:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:00:19] ^ alert reacts to canary issues but maybe should not? [19:00:21] Yes yes, I'm aware icinga [19:00:30] It should, canaries get real traffic too [19:00:38] !log depooled rhodium (puppet master backend) again https://gerrit.wikimedia.org/r/#/c/424646/ [19:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:36] Yep, and if all canaries are 500'ing, that's globally sufficient to trigger the mw fatal graphite metric alerting. [19:01:59] * Krinkle runs [19:02:32] !log demon@tin scap aborted: Forcing full scap, removed clean plugin updates (duration: 11m 03s) [19:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:22] why aborted? [19:03:49] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 80865 bytes in 0.314 second response time [19:03:50] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.069 second response time [19:03:59] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.026 second response time [19:03:59] RECOVERY - Nginx local proxy to apache on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.044 second response time [19:03:59] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 80863 bytes in 0.089 second response time [19:04:09] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time [19:04:09] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.028 second response time [19:04:09] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.056 second response time [19:04:09] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 80875 bytes in 0.148 second response time [19:04:10] RECOVERY - Apache HTTP on mw1265 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.038 second response time [19:04:10] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.039 second response time [19:04:19] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.033 second response time [19:04:19] RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 80864 bytes in 0.102 second response time [19:04:19] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time [19:04:29] RECOVERY - Nginx local proxy to apache on mw1264 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.041 second response time [19:04:30] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 80864 bytes in 0.104 second response time [19:04:30] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 80863 bytes in 0.086 second response time [19:04:30] RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.027 second response time [19:04:50] thcipriani: Well it wasn't going to succeed anyway [19:05:01] Because we sync with --no-update-l10n [19:05:05] Then we rebuild at the end [19:05:09] oh right [19:05:29] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.060 second response time [19:05:30] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 80875 bytes in 0.174 second response time [19:05:58] so for clean I think we just need to set self.include to the relatvie path of the branch being cleaned [19:06:09] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.052 second response time [19:07:13] although that theory bears some testing first [19:08:02] Totally unrelated, but it bugged me finally: https://phabricator.wikimedia.org/D1024 [19:08:56] ah yeah, nice [19:09:09] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.057 second response time [19:09:20] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 80865 bytes in 0.187 second response time [19:09:20] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.162 second response time [19:09:25] idk if we should reuse that timer name or not, but *shrug* [19:09:27] Otherwise trivial [19:09:29] RECOVERY - Nginx local proxy to apache on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.037 second response time [19:09:29] RECOVERY - Nginx local proxy to apache on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.028 second response time [19:09:30] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 80863 bytes in 0.084 second response time [19:09:30] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 80864 bytes in 0.106 second response time [19:09:30] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.029 second response time [19:09:39] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time [19:09:39] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.028 second response time [19:09:50] RECOVERY - Nginx local proxy to apache on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time [19:10:10] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 80863 bytes in 0.096 second response time [19:11:52] !log demon@tin Started scap: Forcing full scap. Mostly no-op, consistency, paranoia, that sort of thing [19:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:07] !log wiki replicas: ran maintain-views --database mediawikiwiki --clean on labsdb10{09,10,11} for T191387 [19:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:13] T191387: Drop unused views for: flaggedrevs tables from mediawikiwiki_p - https://phabricator.wikimedia.org/T191387 [19:14:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:19:25] (03CR) 10Dzahn: [C: 032] "linked to wrong ticket. correct is Bug: T188163" [puppet] - 10https://gerrit.wikimedia.org/r/423580 (https://phabricator.wikimedia.org/T18863) (owner: 10Dzahn) [19:20:35] 10Operations, 10Availability, 10Patch-For-Review: create codfw-equivalent of bromine, make webserver_misc_static active/active in misc varnish - https://phabricator.wikimedia.org/T188163#4112998 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/423580/ [19:21:05] (03PS1) 10Dzahn: rsync bugzilla-static content from bromine to vega [puppet] - 10https://gerrit.wikimedia.org/r/424657 (https://phabricator.wikimedia.org/T188163) [19:21:28] 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4113004 (10Papaul) Hello Papaul, Thank you for sharing the log. I am currently in Training, however I got a chance to look at the TSR and analyzed it. We do see that the firmware on the ser... [19:23:44] !log demon@tin Finished scap: Forcing full scap. Mostly no-op, consistency, paranoia, that sort of thing (duration: 11m 51s) [19:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:43] (03PS2) 10Bstorm: wiki replicas: Remove localisation and localisation_file_hash views [puppet] - 10https://gerrit.wikimedia.org/r/424168 (https://phabricator.wikimedia.org/T119811) (owner: 10BryanDavis) [19:28:15] (03CR) 10Bstorm: [C: 032] wiki replicas: Remove localisation and localisation_file_hash views [puppet] - 10https://gerrit.wikimedia.org/r/424168 (https://phabricator.wikimedia.org/T119811) (owner: 10BryanDavis) [19:30:00] (03PS27) 10Ottomata: Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [19:31:43] (03CR) 10Ottomata: "Ok I've removed host and process_number for now, would like to talk to you more about that though!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [19:31:59] (03PS28) 10Ottomata: Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 [19:33:01] (03CR) 10Ottomata: [C: 032] Set jmx exporter instance common labels at host level [puppet] - 10https://gerrit.wikimedia.org/r/423931 (owner: 10Ottomata) [19:33:14] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4113030 (10Nuria) >number of Opera Mini users in US is far far below India, Indonesia and Nigeria. Note these are "pageviews", not users. @... [19:36:01] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-etcd-01 puppet errors - https://phabricator.wikimedia.org/T191107#4113032 (10EddieGP) @Joe, as the creator of this instance, do you know (a) whether it's still needed and (b) if yes, what value should be set here? [19:38:19] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:39:23] hmm [19:39:30] i saw that when i puppet-merged, wasn't sure what it was talking about [19:39:32] looking [19:40:51] hey bd808 [19:40:54] yt? [19:41:03] did our patches collide somehow? [19:41:40] ottomata: bstorm_ was merging for me if there was a puppet repo problem [19:42:05] it looks like your wiki replicas: Remove localisation patch was submitted/merged in gerrit, right after i did puppet-merge [19:42:13] so, on puppetmaster1001, it was fine [19:42:31] but as all the subordinate puppet masters then had puppet-merge triggered [19:42:33] yours was pulled in [19:42:39] and failed because it was from multiple committers [19:43:03] I actually don't know too much about the other puppetmaster setups [19:43:09] is it safe to run puppet merge on rhodiuM? [19:43:11] i'm going to... [19:43:37] !log running puppet-merge on rhodium after clash between puppet-merge and new patch submitted [19:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:49] hm nope [19:45:53] herron: yt? [19:46:14] Heh... [19:46:23] Author: Herron [19:46:23] Date: Fri Apr 6 18:50:53 2018 +0000 [19:46:23] Revert "puppetmaster: repool rhodium" [19:46:26] somethign else is going on there [19:46:44] I don't actually know how to fix that sort of a mess. [19:46:48] https://gerrit.wikimedia.org/r/#/c/424646/ [19:47:00] it looks like something else is going on, we might not have caused it [19:47:09] but puppet-merge might be a little broken atm [19:47:35] Fair [19:47:50] Good to know [19:48:33] ottomata: hey what’s up? [19:50:42] this happened earlier today so I depooled rhodium to keep it from serving an out of sync puppet repo [19:50:55] herron: i did a regular puppet-merge [19:50:59] near the end i got [19:51:09] (03PS1) 10Ottomata: Sort jmx exporter targets and labels [puppet] - 10https://gerrit.wikimedia.org/r/424661 [19:51:33] WARNING: Revision range includes commits from multiple committers! [19:51:34] Merging ceaed7e8173d6e6b86da638c34303db9f311656b... [19:51:34] git merge --ff-only ceaed7e8173d6e6b86da638c34303db9f311656b [19:51:34] Updating 3aa44a123f..ceaed7e817 [19:51:34] error: unable to unlink old 'modules/profile/templates/kafka/mirror_maker_prometheus_jmx_exporter.yaml.erb': Permission denied [19:51:34] error: unable to unlink old 'modules/profile/templates/labs/db/views/maintain-views.yaml': Permission denied [19:51:34] Connection to rhodium.eqiad.wmnet closed. [19:51:38] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#4113102 (10EddieGP) Also (labeled at "UNKNOWN" in openstack browser, but logging in there and looking at /etc/os-release) these are still trusty: - deployment-urldownloader... [19:51:47] those files are from my puppet commit, and from bstorm_'s [19:51:50] I ended up with the same, naturally [19:51:56] when I ran puppet-merge, it did not tel me about multiple committers [19:52:07] 10Operations, 10ChangeProp, 10Parsing-Team, 10Parsoid, and 6 others: Check concurrency/retry/timeout limits and syncronize those between services - https://phabricator.wikimedia.org/T152073#4113104 (10ssastry) >>! In T152073#3451813, @fgiunchedi wrote: > Is there anything left to be done here? I had the s... [19:52:13] so, i assume that bstorm's was merged right after I started running puppet-merge [19:52:17] but before it ran on rhodium [19:52:42] not sure though, since it looks like you were having some other (maybe related?) trouble with rhodium? [19:53:03] I had it error on: [19:53:10] https://www.irccloud.com/pastebin/9yidYvic/ [19:53:11] (03CR) 10Ottomata: [C: 032] Sort jmx exporter targets and labels [puppet] - 10https://gerrit.wikimedia.org/r/424661 (owner: 10Ottomata) [19:53:23] In case that helps [19:54:04] 10Operations, 10Parsing-Team, 10HHVM, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#4113126 (10Pchelolo) [19:54:10] 10Operations, 10ChangeProp, 10Parsing-Team, 10Parsoid, and 6 others: Check concurrency/retry/timeout limits and syncronize those between services - https://phabricator.wikimedia.org/T152073#4113122 (10Pchelolo) 05Open>03Resolved a:03Pchelolo I guess it can be closed now, there's been no activity here... [19:55:12] huh, yeah for some reason rhodium is not keeping in sync. but not every merge fails [19:55:20] that is helpful thanks [19:55:39] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [19:55:57] 10Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: grant thcipriani RelEng root on contint1001 - https://phabricator.wikimedia.org/T191453#4113132 (10hashar) @RobH indeed we are looking at adding Tyler to the `contint-roots`group. That grant root acces... [19:56:10] herron: this might be blcoking puppet merges to other hosts [19:56:13] i think rhodium happens early on [19:56:16] and the whole thing fails [19:56:35] the latest puppet-merge I did looks like it only got to puppetmaster1001 and 1002 [19:56:39] rhodium was next, and it failed [19:57:26] hmm I would think those hosts would alert in that case but will double check [19:57:38] 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4113134 (10Andrew) >>! In T191648#4112786, @bd808 wrote: > I wonder if the specific ordering issue is the `callable` and `plugins` lines? I though... [19:59:11] !log moved rhodium:/var/lib/git/operations/puppet away and triggered puppet agent run to re-create [19:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:42] 10Operations, 10ops-codfw, 10Traffic: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4113142 (10RobH) @Papaul: Please advise to Dell that we saw the error in the logs we provided, and we aren't willing to use the faulty hardware in production without replacement of the memory modules a... [20:01:44] ottomata: you’re right and looks like alerting may be blind to that condition [20:03:08] I ran puppet-merge from an out of master host (after refreshing the repo on rhodium) and didn’t see any errors, and all masters are in sync now [20:12:14] ok great [20:12:15] thanks herron [20:14:05] (03CR) 10Ottomata: "Luca, I merged https://gerrit.wikimedia.org/r/#/c/423931/, so I think you can abandon this." [puppet] - 10https://gerrit.wikimedia.org/r/423851 (owner: 10Elukey) [20:16:28] (03PS1) 10Ottomata: Set zookeeper_cluster label for Zookeeper server prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/424664 [20:16:52] (03CR) 10Ottomata: "Merge at will elukey! Or modify! Do yo thang! :)" [puppet] - 10https://gerrit.wikimedia.org/r/424664 (owner: 10Ottomata) [20:39:19] 10Operations, 10Parsing-Team, 10HHVM, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#4113237 (10ssastry) Looks like this one can be closed as well since the subtasks as well as parent tasks are resolved. Anythi... [20:40:13] 10Operations, 10Parsing-Team, 10HHVM, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#4113250 (10ssastry) 05Open>03Resolved a:03ssastry Feel free to re-open / create a new ticket with anything else left to... [20:56:21] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Services (next), 10User-Eevans: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659#4113284 (10mobrovac) p:05Low>03High Re-prioritising due to the recent occurrence of this. [20:56:43] bblack: yt? [20:57:42] 10Operations, 10Cassandra, 10Discovery, 10Maps, 10User-Eevans: Remove Cassandra 2.2.6 packages from jessie-wikimedia/thirdparty apt repo - https://phabricator.wikimedia.org/T191627#4113287 (10mobrovac) [20:57:59] 10Operations, 10Cassandra, 10Discovery, 10Maps, and 2 others: Remove Cassandra 2.2.6 packages from jessie-wikimedia/thirdparty apt repo - https://phabricator.wikimedia.org/T191627#4111904 (10mobrovac) [21:01:41] 10Operations, 10netops: Juniper HA audit - https://phabricator.wikimedia.org/T191667#4113291 (10ayounsi) p:05Triage>03Normal [21:13:40] 10Operations, 10hardware-requests: Reclaim/Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4113314 (10RobH) [21:13:53] 10Operations, 10hardware-requests: Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4045706 (10RobH) [21:16:42] 10Operations, 10hardware-requests: Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4113327 (10RobH) [21:19:07] (03PS1) 10RobH: decom eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/424703 (https://phabricator.wikimedia.org/T189566) [21:20:31] (03PS1) 10RobH: decom eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/424705 (https://phabricator.wikimedia.org/T189566) [21:20:38] (03CR) 10RobH: [C: 032] decom eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/424703 (https://phabricator.wikimedia.org/T189566) (owner: 10RobH) [21:20:58] (03CR) 10jerkins-bot: [V: 04-1] decom eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/424705 (https://phabricator.wikimedia.org/T189566) (owner: 10RobH) [21:21:41] ahhh eventlogging.eqiad.wmnet points to eventlog1001 =P [21:21:45] ill just point that to 1002 [21:22:18] (03PS2) 10RobH: decom eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/424705 (https://phabricator.wikimedia.org/T189566) [21:22:25] <3 ci tests. [21:22:58] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4113345 (10RobH) a:05RobH>03Cmjohnson [21:23:26] (03CR) 10RobH: [C: 032] decom eventlog1001 [dns] - 10https://gerrit.wikimedia.org/r/424705 (https://phabricator.wikimedia.org/T189566) (owner: 10RobH) [21:29:05] robh it does?!?! [21:29:12] not anymore [21:29:18] robh i'd just remove it, i've never heard of that before! [21:29:18] since 1001 already had all servicces halted [21:29:19] oh well! [21:29:23] it was just a bad reference [21:29:40] oh, well, if you wanna pull it you know better for that perhaps? i just didnt wanna yank without oversight [21:29:41] heh [21:29:49] i just set it to 1002 instead to be safe [21:29:52] yeah, i've never ever heard of it, i can't think of anyting that would use it [21:29:55] ok [21:30:00] though whoever was using it was using a broken fqdn for weeks [21:30:28] i only noticed because our CI checks specifically for that kind of thing (ensuring A references are not removed when there are cnames that point to them) [21:31:31] PROBLEM - HHVM rendering on mw2213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:32:21] RECOVERY - HHVM rendering on mw2213 is OK: HTTP OK: HTTP/1.1 200 OK - 80853 bytes in 0.290 second response time [21:32:39] (03PS1) 10EddieGP: mediawiki: Move www.wikimedia.org to wwwportals.conf [puppet] - 10https://gerrit.wikimedia.org/r/424707 [21:34:55] (03PS1) 10MusikAnimal: Enable PageAssessments on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424709 (https://phabricator.wikimedia.org/T185023) [21:37:58] (03Draft1) 10Paladox: Gerrit: Add url for avatars [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) [21:38:03] (03PS2) 10Paladox: Gerrit: Add url for avatars [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) [21:38:06] no_justification ^^ [21:41:02] (03CR) 10EddieGP: "Pushing this mostly to make the wikimedia-portal work in beta. I'm totally unsure on how best to proceed for prod and the *.wikimedia.org " [puppet] - 10https://gerrit.wikimedia.org/r/424707 (owner: 10EddieGP) [21:47:21] (03PS1) 10Paladox: add plugin avatars-external [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/424710 [21:47:29] Reception123 ^^ [21:48:09] (03PS2) 10Paladox: add plugin avatars-external [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/424710 [21:49:08] Oh woops [21:49:14] wrong ping i meant no_justification ^^ [22:03:48] 10Operations: New ssh key for production - https://phabricator.wikimedia.org/T191673#4113459 (10Sharvaniharan) [22:04:40] (03PS3) 10Paladox: Gerrit: Add url for avatars [puppet] - 10https://gerrit.wikimedia.org/r/424708 (https://phabricator.wikimedia.org/T191183) [22:10:25] (03PS1) 10BryanDavis: mwvagrant: add lxc-update-config to sudoer commands [puppet] - 10https://gerrit.wikimedia.org/r/424712 [22:13:04] (03PS2) 10Dzahn: rsync bugzilla-static content from bromine to vega [puppet] - 10https://gerrit.wikimedia.org/r/424657 (https://phabricator.wikimedia.org/T188163) [22:19:38] (03CR) 10Paladox: [C: 031] rsync bugzilla-static content from bromine to vega [puppet] - 10https://gerrit.wikimedia.org/r/424657 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [22:19:49] (03PS3) 10EBernhardson: Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) [22:20:07] (03CR) 10Dzahn: [C: 032] rsync bugzilla-static content from bromine to vega [puppet] - 10https://gerrit.wikimedia.org/r/424657 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [22:20:48] robh: are you debugging a DHCP thing? eventlog1001 removal on puppetmaster [22:21:55] i merged it. worst case it will prevent reinstall [22:22:13] and it said decom [22:27:01] mutante: it was decom [22:27:09] so normal, thx for merge [22:27:13] cool, yw [22:28:21] PROBLEM - HHVM rendering on mw2214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:11] RECOVERY - HHVM rendering on mw2214 is OK: HTTP OK: HTTP/1.1 200 OK - 80833 bytes in 0.306 second response time [22:35:57] !log rsyncing bugzilla-static raw html from eqiad to codfw VM [22:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:11] (03PS2) 10Bstorm: mwvagrant: add lxc-update-config to sudoer commands [puppet] - 10https://gerrit.wikimedia.org/r/424712 (owner: 10BryanDavis) [22:53:21] (03CR) 10Bstorm: [C: 032] mwvagrant: add lxc-update-config to sudoer commands [puppet] - 10https://gerrit.wikimedia.org/r/424712 (owner: 10BryanDavis) [22:59:04] (03PS1) 10Dzahn: DHCP: upgrade bromine from jessie to stretch [puppet] - 10https://gerrit.wikimedia.org/r/424716 (https://phabricator.wikimedia.org/T188163) [23:00:37] (03PS2) 10Dzahn: DHCP: upgrade bromine from jessie to stretch [puppet] - 10https://gerrit.wikimedia.org/r/424716 (https://phabricator.wikimedia.org/T188163) [23:00:48] (03CR) 10Dzahn: [C: 032] DHCP: upgrade bromine from jessie to stretch [puppet] - 10https://gerrit.wikimedia.org/r/424716 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [23:08:37] (03CR) 10Dzahn: "i'd like to hear Giuseppe's opinion on this. I would say rather use mwdebug1001 for it than have the canary servers different from regular" [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [23:09:33] (03CR) 10Dzahn: "i can bring it up in a meeting.. until then keep using mwdebug" [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [23:11:26] (03PS3) 10Dzahn: Gerrit: symlink in motd.config from deploy repo [puppet] - 10https://gerrit.wikimedia.org/r/423735 (owner: 10Chad) [23:13:24] (03CR) 10Dzahn: [C: 032] "ah, so the motd that is shown in git actions.. not web UI.. gotcha (talked to Paladox). nice" [puppet] - 10https://gerrit.wikimedia.org/r/423735 (owner: 10Chad) [23:16:27] (03Draft1) 10Paladox: Gerrit: change /var/lib/gerrit2/review_site/etc to /var/lib/gerrit2/review_site [puppet] - 10https://gerrit.wikimedia.org/r/424719 [23:16:29] (03Draft2) 10Paladox: Gerrit: change /var/lib/gerrit2/review_site/etc to /var/lib/gerrit2/review_site [puppet] - 10https://gerrit.wikimedia.org/r/424719 [23:18:42] (03PS3) 10Dzahn: Gerrit: fix invalid relationship for motd file [puppet] - 10https://gerrit.wikimedia.org/r/424719 (owner: 10Paladox) [23:18:59] (03CR) 10Dzahn: [C: 032] Gerrit: fix invalid relationship for motd file [puppet] - 10https://gerrit.wikimedia.org/r/424719 (owner: 10Paladox) [23:19:15] (03CR) 10Dzahn: [C: 032] "follow-up https://gerrit.wikimedia.org/r/#/c/424719/" [puppet] - 10https://gerrit.wikimedia.org/r/423735 (owner: 10Chad) [23:19:21] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:21:50] (03Draft1) 10Paladox: Gerrit: Fix invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/424720 [23:21:53] (03PS2) 10Paladox: Gerrit: Fix invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/424720 [23:23:58] paladox: [23:24:01] we have: file { '/var/lib/gerrit2': [23:24:01] If anyone has time to puppet-merge a one-line beta-only change just for the sake of not having it as cherry-pick but properly merged in beta, it'd be appreciated. :) https://gerrit.wikimedia.org/r/c/424361/ [23:24:08] yep [23:24:14] tested https://gerrit.wikimedia.org/r/424720 [23:24:18] which fixed the problem [23:24:22] Notice: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/etc/motd.config]/ensure: created [23:24:23] paladox: we also have file { '/var/lib/gerrit2/review_site/bin': [23:24:41] paladox: but we do NOT have /var/lib/gerrit2/review_site [23:24:48] so the dir above it and the one below it [23:24:51] but not that [23:24:52] yep [23:24:55] that seems wrong [23:25:07] that was removed in the refactor [23:25:23] but how about adding that [23:25:31] instead of changing the relationship [23:26:11] hmm [23:26:52] ok [23:28:17] can we try that on your instance? [23:28:40] like add another file{} for that and fix the relationship to use that [23:28:57] and make /var/lib/gerrit2/review_site depend on /var/lib/gerrit2 [23:29:12] (03PS3) 10Paladox: Gerrit: Fix invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/424720 [23:29:13] mutante done [23:29:41] (03PS2) 10EddieGP: mediawiki: Move www.wikimedia.org to wwwportals.conf [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) [23:29:52] ACKNOWLEDGEMENT - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn invalid puppet relationship. fixing soon [23:30:05] tested [23:30:26] paladox: nice :) [23:30:31] :) [23:30:57] eddiegp: you say already cherry-picked.. doing it [23:31:03] (03PS3) 10Dzahn: Beta: Unbreak wwwportals [puppet] - 10https://gerrit.wikimedia.org/r/424361 (https://phabricator.wikimedia.org/T173887) (owner: 10EddieGP) [23:31:19] mutante: Thanks :) [23:32:05] (03CR) 10Dzahn: [C: 032] Beta: Unbreak wwwportals [puppet] - 10https://gerrit.wikimedia.org/r/424361 (https://phabricator.wikimedia.org/T173887) (owner: 10EddieGP) [23:32:45] eddiegp: thanks for fixing deployment-prep things [23:33:07] (03PS4) 10Dzahn: Gerrit: Fix invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/424720 (owner: 10Paladox) [23:33:11] It's fun ;) [23:35:46] mutante: Do you know if _j.oe_ is on vacation or something? Haven't read anything from him here for a few days, I wanted to annoy him with some code review. ;-) [23:36:45] eddiegp: yes, i think he is on Easter vacation [23:37:13] (03PS5) 10Paladox: Gerrit: Fix invalid relationship [puppet] - 10https://gerrit.wikimedia.org/r/424720 [23:37:40] Ah, okay. So I guess I can try again next week then. [23:38:35] (03PS6) 10Dzahn: Gerrit: Add missing resource /var/lib/gerrit2/review_site [puppet] - 10https://gerrit.wikimedia.org/r/424720 (owner: 10Paladox) [23:39:41] (03PS7) 10Dzahn: Gerrit: Add missing resource /var/lib/gerrit2/review_site [puppet] - 10https://gerrit.wikimedia.org/r/424720 (owner: 10Paladox) [23:42:19] eddiegp: yep [23:42:40] paladox: i'll merge like that and we can do another follow-up if we want [23:42:44] but it'll fix it first [23:42:53] and thanks [23:42:56] (03CR) 10Dzahn: [C: 032] Gerrit: Add missing resource /var/lib/gerrit2/review_site [puppet] - 10https://gerrit.wikimedia.org/r/424720 (owner: 10Paladox) [23:43:09] ok [23:43:20] confirmed the permissions in prod [23:43:46] :) [23:43:52] that other thing i mentioned might just be an issue if you apply this on a fresh install and ont he first run [23:44:30] Notice: /Stage[main]/Gerrit::Jetty/File[/var/lib/gerrit2/review_site/etc/motd.config]/ensure: created [23:44:39] if you wanna test the git clone .. [23:44:45] but i know it's an empty file [23:46:26] git clone is fine [23:46:28] heh [23:46:29] ack [23:46:41] git clone i mean git pull [23:46:42] :) [23:47:02] puppet is also running, good for now [23:47:19] :) [23:47:28] and there weren't any changes on permissions [23:49:21] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures