[00:04:26] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:05:16] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 77790 bytes in 0.301 second response time [01:48:26] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:16] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 77662 bytes in 0.514 second response time [02:32:26] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.10) (duration: 06m 28s) [02:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:46] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3807707 (10Cmjohnson) I’m not sure if this will help but found this. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1007082 [03:24:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 758.22 seconds [03:56:15] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 143.15 seconds [04:45:31] (03PS1) 10Legoktm: Remove manual firejailing of Score binaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394913 (https://phabricator.wikimedia.org/T181535) [04:47:29] (03CR) 10Legoktm: [C: 04-1] "Not to be merged until Thursday December 7 / when d333ae89df is deployed everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394913 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [05:02:55] (03PS1) 10Legoktm: mediawiki: Remove Score firejail wrappers [puppet] - 10https://gerrit.wikimedia.org/r/394914 [05:03:35] (03PS2) 10Legoktm: mediawiki: Remove Score firejail wrappers [puppet] - 10https://gerrit.wikimedia.org/r/394914 (https://phabricator.wikimedia.org/T181535) [05:04:19] (03CR) 10Legoktm: [C: 04-1] "Not to be merged until Change-Id: I481d7918d53569d47e30f169a2ad72ed29fb3b87 is deployed" [puppet] - 10https://gerrit.wikimedia.org/r/394914 (https://phabricator.wikimedia.org/T181535) (owner: 10Legoktm) [05:12:26] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:25] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 77746 bytes in 0.161 second response time [05:55:26] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [05:55:36] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [06:08:36] PROBLEM - Check Varnish expiry mailbox lag on cp4021 is CRITICAL: CRITICAL: expiry mailbox lag is 2025739 [06:21:27] !log Deploy alter table on s3 master (db1075) without replication - T174569 [06:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:39] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:27:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394916 (https://phabricator.wikimedia.org/T178359) [06:28:15] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/60-zerofetch.conf] [06:29:04] 10Operations: etcd-mirror failure - https://phabricator.wikimedia.org/T181920#3807837 (10Joe) This is a reoccurrence of an old bug (T162013), basically the issue is we have the md resync happening every first Sunday of the month on the conf2* cluster, and that sometimes causes consensus issues. I assumed that s... [06:30:16] PROBLEM - puppet last run on db2090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt] [06:30:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394916 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:31:16] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/furl] [06:32:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394916 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:32:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394916 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:34:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098 - T178359 (duration: 00m 46s) [06:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:13] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:34:39] (03PS2) 10Marostegui: mariadb: Decommission db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393747 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [06:36:15] (03CR) 10Marostegui: [C: 032] mariadb: Decommission db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393747 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [06:37:32] (03Merged) 10jenkins-bot: mariadb: Decommission db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393747 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [06:38:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1044 from config as it will be decommissioned - T181696 (duration: 00m 45s) [06:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:58] T181696: Decommission db1044 - https://phabricator.wikimedia.org/T181696 [06:39:46] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1044 from config as it will be decommissioned - T181696 (duration: 00m 45s) [06:39:48] 10Operations, 10DBA: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3807851 (10Marostegui) [06:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:03] (03CR) 10jenkins-bot: mariadb: Decommission db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393747 (https://phabricator.wikimedia.org/T148078) (owner: 10Jcrespo) [06:40:24] !log Stop MySQL on db1098 to clone db1096.s6 - T178359 [06:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:33] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:45:29] PROBLEM - mysqld processes on db1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [06:50:38] gah! [06:50:40] my fault! [06:50:47] I didn't downtime that check, only the replicaiton ones [06:50:48] sorry [06:55:41] (03PS1) 10Marostegui: mariadb: Decommission db1044 [puppet] - 10https://gerrit.wikimedia.org/r/394919 (https://phabricator.wikimedia.org/T181696) [06:56:08] (03PS1) 10Marostegui: s3.hosts: Remove db1044 [software] - 10https://gerrit.wikimedia.org/r/394920 (https://phabricator.wikimedia.org/T181696) [06:56:15] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:58:03] (03CR) 10Marostegui: [C: 032] s3.hosts: Remove db1044 [software] - 10https://gerrit.wikimedia.org/r/394920 (https://phabricator.wikimedia.org/T181696) (owner: 10Marostegui) [06:58:15] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:44] (03Merged) 10jenkins-bot: s3.hosts: Remove db1044 [software] - 10https://gerrit.wikimedia.org/r/394920 (https://phabricator.wikimedia.org/T181696) (owner: 10Marostegui) [06:59:09] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/9126/" [puppet] - 10https://gerrit.wikimedia.org/r/394919 (https://phabricator.wikimedia.org/T181696) (owner: 10Marostegui) [06:59:11] (03CR) 10Marostegui: [C: 032] mariadb: Decommission db1044 [puppet] - 10https://gerrit.wikimedia.org/r/394919 (https://phabricator.wikimedia.org/T181696) (owner: 10Marostegui) [07:00:25] RECOVERY - puppet last run on db2090 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:05:55] <_joe_> !log playing with puppetdb status for ores2003 (deactivating/reactivating node) [07:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:23] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3807873 (10Marostegui) [07:08:38] !log Stop MySQL on db1044 as it will be decommissioned - T181696 [07:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:47] T181696: Decommission db1044 - https://phabricator.wikimedia.org/T181696 [07:09:40] (03PS1) 10Giuseppe Lavagetto: ores2003: disable notifications temporarily for puppetdb testing [puppet] - 10https://gerrit.wikimedia.org/r/394921 [07:10:21] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1044 - https://phabricator.wikimedia.org/T181696#3807875 (10Marostegui) a:05Marostegui>03Cmjohnson This host is now fully ready to be decommissioned. [07:13:11] PROBLEM - trendingedits endpoints health on scb1001 is CRITICAL: /_info/name (retrieve service name) timed out before a response was received: /robots.txt (robots.txt check) timed out before a response was received: / (root with no query params) timed out before a response was received: / (spec from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received: /_info/h [07:13:11] e home page) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /_info/version (retrieve service version) timed out before a response was received: /{domain}/v1/feed/trending-edits{/period} (retrieve trending articles within the last hour) timed out before a response was received [07:13:28] <_joe_> oh seriously, trendingedits again [07:14:11] (03CR) 10Giuseppe Lavagetto: [C: 032] ores2003: disable notifications temporarily for puppetdb testing [puppet] - 10https://gerrit.wikimedia.org/r/394921 (owner: 10Giuseppe Lavagetto) [07:17:24] !log Compress s1 on db1099 - T178359 [07:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:34] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [07:20:02] RECOVERY - trendingedits endpoints health on scb1001 is OK: All endpoints are healthy [07:25:11] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [07:34:12] (03PS1) 10Giuseppe Lavagetto: ores2003: temporarily disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/394922 [07:36:23] (03CR) 10Giuseppe Lavagetto: [C: 032] ores2003: temporarily disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/394922 (owner: 10Giuseppe Lavagetto) [07:36:51] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 0 [07:41:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=ulsfovar-cache_type=Allvar-status_type=5 [07:43:26] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=uploadvar-status_type=5 [07:45:19] (03PS1) 10Giuseppe Lavagetto: ores2003: remove hiera file as tests are over [puppet] - 10https://gerrit.wikimedia.org/r/394924 [07:47:02] (03PS1) 10Marostegui: s6.hosts: Add db1096:3316 [software] - 10https://gerrit.wikimedia.org/r/394925 (https://phabricator.wikimedia.org/T178359) [07:48:21] (03CR) 10Marostegui: [C: 032] s6.hosts: Add db1096:3316 [software] - 10https://gerrit.wikimedia.org/r/394925 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:48:41] (03CR) 10Giuseppe Lavagetto: [C: 032] ores2003: remove hiera file as tests are over [puppet] - 10https://gerrit.wikimedia.org/r/394924 (owner: 10Giuseppe Lavagetto) [07:49:06] (03Merged) 10jenkins-bot: s6.hosts: Add db1096:3316 [software] - 10https://gerrit.wikimedia.org/r/394925 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [07:52:35] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394926 (https://phabricator.wikimedia.org/T178359) [07:53:58] !log installing curl security updates [07:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:34] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394926 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:09:48] (03PS1) 10Muehlenhoff: Remove firejail wrappers for timidity, lilypond and abc2ly [puppet] - 10https://gerrit.wikimedia.org/r/394927 [08:09:52] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394926 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:10:04] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394926 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:11:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1096:3315 - T178359 (duration: 00m 45s) [08:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:41] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:12:12] (03PS1) 10Marostegui: install_server: Allow reimage of db1098 [puppet] - 10https://gerrit.wikimedia.org/r/394928 (https://phabricator.wikimedia.org/T178359) [08:12:23] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db1096:3315 - T178359 (duration: 00m 44s) [08:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:55] (03CR) 10Marostegui: [C: 032] install_server: Allow reimage of db1098 [puppet] - 10https://gerrit.wikimedia.org/r/394928 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:20:23] PROBLEM - trendingedits endpoints health on scb1001 is CRITICAL: /_info/name (retrieve service name) timed out before a response was received [08:21:03] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1096:331{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394929 (https://phabricator.wikimedia.org/T178359) [08:24:53] (03PS2) 10Marostegui: db-eqiad.php: Increase traffic for db1096:331{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394929 (https://phabricator.wikimedia.org/T178359) [08:26:14] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Increase traffic for db1096:331{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394929 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:26:16] (03PS3) 10Marostegui: db-eqiad.php: Increase traffic for db1096:331{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394929 (https://phabricator.wikimedia.org/T178359) [08:27:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1096:331{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394929 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:29:13] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1096:331{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394929 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:29:23] RECOVERY - trendingedits endpoints health on scb1001 is OK: All endpoints are healthy [08:29:27] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1096:331{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394929 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [08:30:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1096:3315 and pool db1096:3316 - T178359 (duration: 00m 45s) [08:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:41] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:37:06] (03CR) 10Hashar: [C: 031] labnodepool: move standard/firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/392769 (owner: 10Dzahn) [08:38:14] (03PS1) 10Marostegui: db-eqiad.php: Increase db1096:3316 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394931 [08:41:38] !log updating tor packages to 0.3.1.9 [08:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1096:3316 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394931 (owner: 10Marostegui) [08:43:52] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1096:3316 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394931 (owner: 10Marostegui) [08:44:46] !log updating tor on radium to 0.3.1.9 [08:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:54] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1096:3316 - T178359 (duration: 00m 45s) [08:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:03] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:46:52] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1096:3316 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394931 (owner: 10Marostegui) [08:49:30] (03Abandoned) 10Giuseppe Lavagetto: monitoring: use service_checker for mobileapps LVS [puppet] - 10https://gerrit.wikimedia.org/r/287908 (https://phabricator.wikimedia.org/T134551) (owner: 10Giuseppe Lavagetto) [08:53:21] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1096:3315,6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394939 [08:55:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1096:3315,6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394939 (owner: 10Marostegui) [08:56:37] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1096:3315,6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394939 (owner: 10Marostegui) [08:57:10] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1096:3315,6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394939 (owner: 10Marostegui) [08:57:37] (03Abandoned) 10Giuseppe Lavagetto: Use internal url for Ores, move to ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316317 (owner: 10Giuseppe Lavagetto) [08:57:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1096:3315 and 3316 - T178359 (duration: 00m 45s) [08:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:01] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [08:58:09] (03Abandoned) 10Giuseppe Lavagetto: Use discovery url for Ores as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345510 (owner: 10Giuseppe Lavagetto) [08:58:36] (03Abandoned) 10Giuseppe Lavagetto: swift: refactor to role/profile pattern, part 1 [puppet] - 10https://gerrit.wikimedia.org/r/314829 (https://phabricator.wikimedia.org/T147718) (owner: 10Giuseppe Lavagetto) [09:00:05] (03PS1) 10ArielGlenn: actually remove old misc dumps content [puppet] - 10https://gerrit.wikimedia.org/r/394943 [09:01:38] (03PS3) 10Jcrespo: mariadb: Reenable notifications on db2085 after s1 reimport [puppet] - 10https://gerrit.wikimedia.org/r/394622 (https://phabricator.wikimedia.org/T178359) [09:02:03] (03PS1) 10Marostegui: db-eqiad.php: Remove db1099 from s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394944 (https://phabricator.wikimedia.org/T178359) [09:02:27] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on db2085 after s1 reimport [puppet] - 10https://gerrit.wikimedia.org/r/394622 (https://phabricator.wikimedia.org/T178359) (owner: 10Jcrespo) [09:03:21] (03CR) 10ArielGlenn: [C: 032] actually remove old misc dumps content [puppet] - 10https://gerrit.wikimedia.org/r/394943 (owner: 10ArielGlenn) [09:03:36] (03PS2) 10ArielGlenn: actually remove old misc dumps content [puppet] - 10https://gerrit.wikimedia.org/r/394943 [09:10:39] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Remove db1099 from s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394944 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:12:05] !log reimaging mw1259 (video scaler) to stretch, will be kept disabled initially (some controlled live tests following) [09:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:35] (03Abandoned) 10Giuseppe Lavagetto: profile::redis::multidc_instance: separate concerns with confd [puppet] - 10https://gerrit.wikimedia.org/r/350404 (owner: 10Giuseppe Lavagetto) [09:17:11] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#3807983 (10ArielGlenn) [09:19:18] !log rebooting mariadb at labsdb1005 [09:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:26] !log reboot analytics104* (hadoop worker nodes) for kernel+jvm updates - T179943 [09:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:35] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [09:24:55] RECOVERY - puppet last run on ms-be1039 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:25:25] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T181028#3807994 (10fgiunchedi) 05Open>03Resolved @Cmjohnson thanks! disk is rebuilding. [09:26:09] ah no I am wrong, I need to start from 105* [09:26:15] 104* already done [09:29:38] (03PS1) 10Muehlenhoff: Install mw1259 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/394946 [09:29:47] (03PS2) 10Giuseppe Lavagetto: puppet: disable hiera autolookup [puppet] - 10https://gerrit.wikimedia.org/r/380304 [09:30:44] RECOVERY - Disk space on graphite1003 is OK: DISK OK [09:30:58] (03CR) 10Muehlenhoff: [C: 032] Install mw1259 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/394946 (owner: 10Muehlenhoff) [09:31:52] \o/ [09:32:07] !log clear erroneous table metrics from graphite1003 / graphite2002 - T181689 [09:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:19] T181689: New RESTBase Cassandra cluster has legacy tables - https://phabricator.wikimedia.org/T181689 [09:35:30] (03PS1) 10Gehel: maps: Bump maximum zoom to 19 [puppet] - 10https://gerrit.wikimedia.org/r/394948 (https://phabricator.wikimedia.org/T180907) [09:48:05] (03PS1) 10Filippo Giunchedi: hieradata: enable restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/394951 (https://phabricator.wikimedia.org/T179422) [09:49:29] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/394951 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [09:51:06] (03CR) 10Filippo Giunchedi: "I didn't spotted this and merged https://gerrit.wikimedia.org/r/#/c/394951/ already" [puppet] - 10https://gerrit.wikimedia.org/r/394603 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [09:51:34] !log bootstrap restbase1012-c - T179422 [09:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:44] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [09:53:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Remove db1099 from s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394944 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:54:43] (03Merged) 10jenkins-bot: db-eqiad.php: Remove db1099 from s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394944 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:55:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1099:3318 from s5 (duration: 00m 44s) [09:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:06] (03CR) 10jenkins-bot: db-eqiad.php: Remove db1099 from s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394944 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:58:11] 10Operations, 10Puppet, 10User-Joe: Update puppet code to conform to puppet 4.x and later standards - https://phabricator.wikimedia.org/T181967#3808087 (10Joe) [10:00:41] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394954 (https://phabricator.wikimedia.org/T178359) [10:03:54] (03PS1) 10Marostegui: mariadb: Conver db1098 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/394955 (https://phabricator.wikimedia.org/T178359) [10:03:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully pool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394954 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:05:48] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394954 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:05:55] (03PS2) 10Marostegui: mariadb: Convert db1098 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/394955 (https://phabricator.wikimedia.org/T178359) [10:06:52] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394954 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:06:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully pool db1096:3316 - T178359wq! (duration: 00m 45s) [10:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:18] (03PS1) 10ArielGlenn: make config subdirs for dumps on web servers owned by root [puppet] - 10https://gerrit.wikimedia.org/r/394956 [10:09:19] (03CR) 10ArielGlenn: [C: 032] make config subdirs for dumps on web servers owned by root [puppet] - 10https://gerrit.wikimedia.org/r/394956 (owner: 10ArielGlenn) [10:09:41] (03PS1) 10Ema: mtail: add 'ensure' parameter, remove 'enabled' [puppet] - 10https://gerrit.wikimedia.org/r/394957 [10:10:11] (03PS2) 10Ema: mtail: add 'ensure' parameter, remove 'enabled' [puppet] - 10https://gerrit.wikimedia.org/r/394957 [10:11:54] (03CR) 10Filippo Giunchedi: Add a Prometheus exporter for PDNS recursor (033 comments) [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394557 (owner: 10Muehlenhoff) [10:12:12] (03CR) 10Filippo Giunchedi: [C: 031] Grant prometheus user to run rec_control on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394554 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [10:12:28] (03PS1) 10ArielGlenn: add dumpsgen user to labstore1006/7 [puppet] - 10https://gerrit.wikimedia.org/r/394958 [10:12:46] (03CR) 10Filippo Giunchedi: Add Prometheus exporter to DNS recursors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394562 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [10:13:13] (03CR) 10Filippo Giunchedi: [C: 031] Add pdns rec exporters to Prometheus scraper config [puppet] - 10https://gerrit.wikimedia.org/r/394564 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [10:14:42] PROBLEM - MD RAID on mw1259 is CRITICAL: Return code of 255 is out of bounds [10:15:55] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3808161 (10MoritzMuehlenhoff) No, I think the Launchpad bug is unrelated: It seems an older kernel in the 3.2 days had a bug which exposed the same fallout, but our current kernel... [10:16:17] (03CR) 10ArielGlenn: [C: 032] add dumpsgen user to labstore1006/7 [puppet] - 10https://gerrit.wikimedia.org/r/394958 (owner: 10ArielGlenn) [10:17:53] (03PS3) 10Marostegui: mariadb: Convert db1098 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/394955 (https://phabricator.wikimedia.org/T178359) [10:19:00] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/9130/" [puppet] - 10https://gerrit.wikimedia.org/r/394955 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:19:40] (03CR) 10Marostegui: [C: 032] mariadb: Convert db1098 to multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/394955 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:21:37] (03PS2) 10ArielGlenn: move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 [10:22:11] (03CR) 10jerkins-bot: [V: 04-1] move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 (owner: 10ArielGlenn) [10:23:03] (03CR) 10Muehlenhoff: Add Prometheus exporter to DNS recursors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394562 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [10:23:06] (03PS3) 10ArielGlenn: move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 (https://phabricator.wikimedia.org/T179942) [10:23:17] (03PS2) 10Muehlenhoff: Add Prometheus exporter to DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394562 (https://phabricator.wikimedia.org/T181620) [10:23:56] (03PS3) 10Ema: mtail: add 'ensure' parameter, remove 'enabled' [puppet] - 10https://gerrit.wikimedia.org/r/394957 [10:25:03] PROBLEM - SSH on ms-be1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:46] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/9133/mx1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/394957 (owner: 10Ema) [10:27:01] RECOVERY - SSH on ms-be1039 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [10:27:12] PROBLEM - very high load average likely xfs on ms-be1039 is CRITICAL: CRITICAL - load average: 126.33, 102.04, 78.49 [10:27:23] (03PS8) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) [10:27:57] (03CR) 10Ema: [V: 032 C: 032] vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) (owner: 10Ema) [10:28:10] uh uh, 1039 had one of its disks replaced earlier [10:28:14] I'll take a look [10:29:00] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db2085:3311 (s1) after being moved from db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394615 (https://phabricator.wikimedia.org/T178359) (owner: 10Jcrespo) [10:29:12] (03PS2) 10Jcrespo: mariadb: Pool db2085:3311 (s1) after being moved from db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394615 (https://phabricator.wikimedia.org/T178359) [10:32:04] ok I'm not sure of ms-be1039 status, I'll reboot [10:32:12] RECOVERY - very high load average likely xfs on ms-be1039 is OK: OK - load average: 64.48, 79.84, 75.46 [10:34:01] RECOVERY - MD RAID on mw1259 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:34:08] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3808244 (10Aklapper) @Groovier should be able to access and sign L2 now, as per https://wikitech.wikimedia.org/wiki/Volunteer_NDA [10:35:29] (03CR) 10Filippo Giunchedi: [C: 031] mtail: add 'ensure' parameter, remove 'enabled' [puppet] - 10https://gerrit.wikimedia.org/r/394957 (owner: 10Ema) [10:35:46] (03PS2) 10Muehlenhoff: Grant prometheus user to run rec_control on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394554 (https://phabricator.wikimedia.org/T181620) [10:36:50] (03CR) 10jenkins-bot: mariadb: Pool db2085:3311 (s1) after being moved from db2092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394615 (https://phabricator.wikimedia.org/T178359) (owner: 10Jcrespo) [10:37:30] (03CR) 10Filippo Giunchedi: [C: 031] Add Prometheus exporter to DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394562 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [10:37:33] (03PS4) 10Ema: mtail: add 'ensure' parameter, remove 'enabled' [puppet] - 10https://gerrit.wikimedia.org/r/394957 [10:37:42] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2085 (duration: 00m 43s) [10:37:42] (03CR) 10Ema: [V: 032 C: 032] mtail: add 'ensure' parameter, remove 'enabled' [puppet] - 10https://gerrit.wikimedia.org/r/394957 (owner: 10Ema) [10:37:44] (03PS11) 10Jcrespo: mariadb: Remove mariadb.pp and move some old roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) [10:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:33] (03CR) 10Filippo Giunchedi: mediawiki/hhvm: Move fatal-error.php to Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379953 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [10:41:43] (03PS1) 10Muehlenhoff: Install lilypond from stretch-backports [puppet] - 10https://gerrit.wikimedia.org/r/394963 [10:44:46] (03PS11) 10Ema: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [10:47:03] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [10:47:08] (03PS3) 10Muehlenhoff: Grant prometheus user to run rec_control on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394554 (https://phabricator.wikimedia.org/T181620) [10:47:32] 10Operations, 10Puppet, 10User-Joe: Disable hiera autolookups - https://phabricator.wikimedia.org/T181971#3808267 (10Joe) [10:47:59] (03CR) 10Muehlenhoff: [C: 032] Grant prometheus user to run rec_control on DNS recursors [puppet] - 10https://gerrit.wikimedia.org/r/394554 (https://phabricator.wikimedia.org/T181620) (owner: 10Muehlenhoff) [10:48:06] (03CR) 10Ema: "pcc output lgtm https://puppet-compiler.wmflabs.org/compiler02/9134/" [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [10:48:59] (03PS12) 10Ema: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [10:49:20] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3808286 (10fgiunchedi) [10:50:17] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394964 (https://phabricator.wikimedia.org/T128546) [10:53:21] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3808306 (10fgiunchedi) [10:53:47] (03PS1) 10Marostegui: s6.hosts: db1098 will replicate on 3316 [software] - 10https://gerrit.wikimedia.org/r/394965 (https://phabricator.wikimedia.org/T178359) [10:56:06] (03CR) 10Marostegui: [C: 032] s6.hosts: db1098 will replicate on 3316 [software] - 10https://gerrit.wikimedia.org/r/394965 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:56:43] (03CR) 10Jcrespo: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/9136/ I think there are some firewall applications I am not sure about on misc services, ot" [puppet] - 10https://gerrit.wikimedia.org/r/394541 (https://phabricator.wikimedia.org/T150850) (owner: 10Jcrespo) [10:56:49] (03Merged) 10jenkins-bot: s6.hosts: db1098 will replicate on 3316 [software] - 10https://gerrit.wikimedia.org/r/394965 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [11:00:05] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171204T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:01:05] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394964 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:02:26] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394964 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:02:41] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394964 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:05:36] !log jdrewniak@tin Synchronized portals/prod/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:394964|Bumping portals to master (T128546)]] (duration: 00m 45s) [11:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:48] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:06:22] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:394964|Bumping portals to master (T128546)]] (duration: 00m 45s) [11:06:31] (03PS1) 10Elukey: [WIP] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [11:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:26] (03PS2) 10Elukey: [WIP] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [11:11:02] (03PS3) 10Elukey: [WIP] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [11:11:54] (03PS13) 10Ema: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [11:13:01] (03PS4) 10Elukey: [WIP] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [11:14:33] PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:14:39] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler02/9140/nitrogen.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [11:14:54] PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:14:58] 1056 is me! I am draining it and it is taking a long time [11:15:10] adding downtime [11:21:38] (03CR) 10Ema: [C: 032] varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [11:24:24] PROBLEM - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.204 and port 9042: Connection refused [11:25:03] RECOVERY - Hadoop NodeManager on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:25:34] RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:30:34] (03PS1) 10KartikMistry: hfst: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/394967 (https://phabricator.wikimedia.org/T181463) [11:45:09] (03CR) 10KartikMistry: [C: 04-1] "-1 until some more testing is done." [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/394967 (https://phabricator.wikimedia.org/T181463) (owner: 10KartikMistry) [11:51:16] 10Operations, 10ops-eqiad, 10Analytics-Kanban: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3808499 (10mark) >>! In T181518#3800473, @elukey wrote: > Just spoke to Chris and Faidon on IRC, and with my team. The best option seems to be repurpose notebook1002.eqiad.wmnet to kafka1023.eqi... [11:52:49] (03PS4) 10ArielGlenn: move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 (https://phabricator.wikimedia.org/T179942) [12:09:12] (03CR) 10MarcoAurelio: "If there are no concerns with this being added I'll try to find a suitable deployment window to merge this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394379 (https://phabricator.wikimedia.org/T181713) (owner: 10MarcoAurelio) [12:13:31] (03PS5) 10ArielGlenn: move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 (https://phabricator.wikimedia.org/T179942) [12:16:55] (03CR) 10ArielGlenn: [C: 032] move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [12:17:28] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:18] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 77705 bytes in 0.305 second response time [12:28:43] 10Operations: etcd-mirror failure - https://phabricator.wikimedia.org/T181920#3808641 (10akosiaris) I think that the problem for 'NoneType' + 'int' is worth looking into a bit more in any case. I am also utterly unsure why the processes logged so many times the "Can't stop reactor" messages, was it trying to cle... [12:31:44] (03PS1) 10ArielGlenn: make misc dump cron scripts owned by root [puppet] - 10https://gerrit.wikimedia.org/r/394976 [12:32:07] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:32:30] (03PS2) 10ArielGlenn: make misc dump cron scripts owned by root [puppet] - 10https://gerrit.wikimedia.org/r/394976 [12:33:07] (03CR) 10ArielGlenn: [C: 032] make misc dump cron scripts owned by root [puppet] - 10https://gerrit.wikimedia.org/r/394976 (owner: 10ArielGlenn) [12:41:45] (03PS1) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [12:42:10] (03CR) 10ArielGlenn: "As promised, here is some copy-pasta to get started on these." [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [12:42:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [12:43:48] !log gehel@tin Started deploy [kartotherian/deploy@e166d87]: testing new kartotherian packaging on maps-test2003 [12:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:10] !log gehel@tin Finished deploy [kartotherian/deploy@e166d87]: testing new kartotherian packaging on maps-test2003 (duration: 00m 22s) [12:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:16] (03CR) 10Alexandros Kosiaris: "I did some extra search on this one, the file is on all puppetmasters with" [puppet] - 10https://gerrit.wikimedia.org/r/394585 (owner: 10Alexandros Kosiaris) [13:05:49] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3808791 (10akosiaris) >>! In T181661#3804344, @awight wrote: > I just ran scap with `-l "ores1001.*" and deployment went smoothly.... [13:05:52] !log Compress s6 on db1098 - T178359 [13:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:03] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [13:06:46] 10Operations, 10Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#963609 (10akosiaris) What's the status of this ? [13:07:01] (03CR) 10Alexandros Kosiaris: [C: 031] diadem/dysprosium: introduce skeleton role [puppet] - 10https://gerrit.wikimedia.org/r/394624 (https://phabricator.wikimedia.org/T169566) (owner: 10Dzahn) [13:08:22] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Investigate and improve memory allocation rates of WDQS - https://phabricator.wikimedia.org/T181988#3808796 (10Gehel) [13:09:01] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Investigate and improve memory allocation rates of WDQS - https://phabricator.wikimedia.org/T181988#3808811 (10Gehel) [13:10:20] (03PS1) 10Giuseppe Lavagetto: [WiP] Puppet 4 compatibility [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/394981 [13:10:42] (03CR) 10jerkins-bot: [V: 04-1] [WiP] Puppet 4 compatibility [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/394981 (owner: 10Giuseppe Lavagetto) [13:10:56] (03PS1) 10Muehlenhoff: Add a Prometheus exporter for PDNS recursor [debs/prometheus-pdns-rec-exporter] - 10https://gerrit.wikimedia.org/r/394982 [13:13:17] PROBLEM - SSH on ms-be1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:35] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394984 (https://phabricator.wikimedia.org/T174569) [13:18:02] (03CR) 10Elukey: "Added more people to the code review. Let me know if you like the idea or not!" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [13:19:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394984 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:20:40] !log Deploy schema change on db1105:3312 (s2) - T174569 [13:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:48] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [13:21:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394984 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:21:17] RECOVERY - SSH on ms-be1039 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u1 (protocol 2.0) [13:21:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394984 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [13:21:27] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:22:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105:3312 - T174569 (duration: 00m 45s) [13:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:26] (03PS1) 10Elukey: site.pp: set notebook1002 as spare::system [puppet] - 10https://gerrit.wikimedia.org/r/394985 (https://phabricator.wikimedia.org/T181518) [13:24:38] PROBLEM - very high load average likely xfs on ms-be1039 is CRITICAL: CRITICAL - load average: 146.56, 106.92, 82.71 [13:24:40] !log upgrading bast2001 to stretch [13:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:51] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Icinga check for WDQS should do an actual query - https://phabricator.wikimedia.org/T181989#3808841 (10Gehel) [13:30:11] (03PS1) 10Gehel: wdqs: use the /readiness-probe in WDQS icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/394987 (https://phabricator.wikimedia.org/T181989) [13:32:57] PROBLEM - DPKG on bast2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:33:58] PROBLEM - Check whether ferm is active by checking the default input chain on bast2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [13:36:38] PROBLEM - puppet last run on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:07] PROBLEM - configured eth on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:17] PROBLEM - Disk space on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:17] PROBLEM - CPU frequency on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:18] PROBLEM - Check size of conntrack table on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:28] PROBLEM - Confd template for /etc/dsh/group/mediawiki-installation on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:28] PROBLEM - confd service on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:38] PROBLEM - MD RAID on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:47] PROBLEM - Confd template for /etc/dsh/group/parsoid on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:47] PROBLEM - Check systemd state on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:57] PROBLEM - Confd template for /etc/dsh/group/jobrunner on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:59] PROBLEM - Confd template for /etc/dsh/group/maps on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:59] PROBLEM - dhclient process on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:37:59] PROBLEM - Confd template for /etc/dsh/group/cassandra on bast2001 is CRITICAL: Return code of 255 is out of bounds [13:39:57] RECOVERY - Confd template for /etc/dsh/group/jobrunner on bast2001 is OK: No errors detected [13:39:58] RECOVERY - Confd template for /etc/dsh/group/maps on bast2001 is OK: No errors detected [13:40:08] RECOVERY - dhclient process on bast2001 is OK: PROCS OK: 0 processes with command name dhclient [13:40:08] RECOVERY - Confd template for /etc/dsh/group/cassandra on bast2001 is OK: No errors detected [13:40:08] RECOVERY - configured eth on bast2001 is OK: OK - interfaces up [13:40:17] RECOVERY - Disk space on bast2001 is OK: DISK OK [13:40:18] RECOVERY - CPU frequency on bast2001 is OK: OK: CPU frequency is = 600 MHz (1202 MHz) [13:40:18] RECOVERY - Check size of conntrack table on bast2001 is OK: OK: nf_conntrack is 0 % full [13:40:35] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review: kafka1018 fails to boot - https://phabricator.wikimedia.org/T181518#3808893 (10elukey) My idea is to: 1) set notebook1002 as spare::system and clean it up from running services. 2) use wmf-auto-reimage to rename the host to kafka1023 if po... [13:40:37] RECOVERY - Confd template for /etc/dsh/group/mediawiki-installation on bast2001 is OK: No errors detected [13:40:37] RECOVERY - confd service on bast2001 is OK: OK - confd is active [13:40:47] RECOVERY - MD RAID on bast2001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:40:47] RECOVERY - Confd template for /etc/dsh/group/parsoid on bast2001 is OK: No errors detected [13:41:07] RECOVERY - Check whether ferm is active by checking the default input chain on bast2001 is OK: OK ferm input default policy is set [13:42:01] (03PS4) 10Gehel: udp2log: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) [13:42:48] RECOVERY - Check systemd state on bast2001 is OK: OK - running: The system is fully operational [13:43:33] (03PS3) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [13:43:35] (03PS3) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [13:43:37] (03PS1) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [13:43:47] RECOVERY - very high load average likely xfs on ms-be1039 is OK: OK - load average: 52.91, 66.61, 78.36 [13:44:07] RECOVERY - DPKG on bast2001 is OK: All packages OK [13:48:21] (03PS2) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [13:51:37] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:54:37] PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:54:43] (that's me) [13:55:18] RECOVERY - cassandra-c CQL 10.64.32.204:9042 on restbase1012 is OK: TCP OK - 0.000 second response time on 10.64.32.204 port 9042 [13:55:27] RECOVERY - Host bast2001 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [14:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171204T1400). Please do the needful. [14:00:04] Jhs: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] I can SWAT today [14:00:16] i'm here! [14:00:25] (03PS5) 10Gehel: udp2log: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) [14:01:16] Jhs: so I merge the commit, deploy to mwdebug1002, run the script and let you know so you can test? if all good full deploy? [14:01:19] sounds good? [14:01:32] zeljkof, sounds good yeah [14:01:37] reboot analytics106* (hadoop worker nodes) for kernel+jvm updates - T179943 [14:01:38] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [14:01:42] ufff [14:01:52] !log reboot analytics106* (hadoop worker nodes) for kernel+jvm updates - T179943 [14:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:02] all right [14:02:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394771 (https://phabricator.wikimedia.org/T181782) (owner: 10Jon Harald Søby) [14:03:46] (03CR) 10Gehel: "Puppet compiler looks happy" [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:03:48] (03PS1) 10Marostegui: mariadb: Enable barracuda on the last roles [puppet] - 10https://gerrit.wikimedia.org/r/394994 (https://phabricator.wikimedia.org/T150949) [14:04:13] (03Merged) 10jenkins-bot: Localize sitename and meta NS for wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394771 (https://phabricator.wikimedia.org/T181782) (owner: 10Jon Harald Søby) [14:05:05] (03CR) 10Marostegui: [C: 032] mariadb: Enable barracuda on the last roles [puppet] - 10https://gerrit.wikimedia.org/r/394994 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [14:07:01] (03CR) 10jenkins-bot: Localize sitename and meta NS for wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394771 (https://phabricator.wikimedia.org/T181782) (owner: 10Jon Harald Søby) [14:07:11] Jhs: `scap pull` is taking forever at mwdebug1002 :/ [14:07:23] ok, done, running the script... [14:07:57] (Y) [14:09:48] (03PS18) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [14:09:58] (03CR) 10Zfilipin: "zfilipin@terbium:~$ mwscript namespaceDupes.php wawiktionary --fix --move-talk" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394771 (https://phabricator.wikimedia.org/T181782) (owner: 10Jon Harald Søby) [14:10:00] (03PS6) 10TerraCodes: Remove single editor tab for plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393121 (https://phabricator.wikimedia.org/T181045) [14:10:27] Jhs: its at mwdebug1002, the script has finished, please test and let me know if I can deploy [14:11:04] zeljkof, i think something is wrong [14:11:13] uh oh [14:11:16] on https://wa.wiktionary.org/w/index.php?title=Sipeci%C3%A5s%3AIndecse+pa+betchete&prefix=Wiccionaire&namespace=0 there are plenty of pages, but when you click them it says they don't exist [14:11:45] those are pages that already had the prefix [14:12:56] maybe the script is not correct? https://gerrit.wikimedia.org/r/#/c/394771/3 [14:13:06] it only found 1 page [14:13:09] zeljkof, perhaps try to run the script again with these variables: php namespaceDupes.php wawiktionary --fix --source-pseudo-namespace=Wiccionaire --dest-namespace=Wiccionaire [14:13:22] not sure about the dest-namespace part, if it should be the name or the numeric ID [14:13:32] let me try... [14:14:59] sorry zeljkof it should be number [14:15:08] ok, did not run it yet [14:15:14] so, what is the correct script? [14:16:11] php namespaceDupes.php --fix --source-pseudo-namespace=Wiccionaire --dest-namespace=4 --move-talk [14:17:00] zeljkof, php namespaceDupes.php wawiktionary --fix --source-pseudo-namespace=Wiccionaire --dest-namespace=4 --move-talk [14:17:06] forgot the wawiktionary part [14:18:45] ok, executed, found more stuff [14:18:56] but the script ended in "Oh noeees"? [14:18:58] will paste [14:19:34] i think that is because of this page: https://wa.wiktionary.org/wiki/Wiccionaire:Babel [14:19:47] it existed both as "Wiccionaire:Babel" and "Wiktionary:Babel" [14:20:10] the former was just a redirect to the latter, so it could just be deleted, but how? [14:20:28] 10Operations, 10Goal, 10User-fgiunchedi: Port nutcracker statistics to Prometheus - https://phabricator.wikimedia.org/T181995#3809040 (10fgiunchedi) [14:20:38] Jhs: https://phabricator.wikimedia.org/T181782#3809054 [14:20:51] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10fgiunchedi) [14:22:35] Jhs: https://phabricator.wikimedia.org/T181782#3809074 [14:23:15] zeljkof, seems my assumption was correct. so if you try to run the script one last time, with the variable --add-prefix=Moved [14:23:20] 10Operations, 10Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#3809083 (10herron) 05stalled>03Resolved a:03herron Stalled. Going to close this as the dmarcian account is working. If down the road that changes we can re-... [14:23:51] Jhs: this? mwscript namespaceDupes.php wawiktionary --fix --source-pseudo-namespace=Wiccionaire --dest-namespace=4 --move-talk --add-prefix=Moved [14:24:01] zeljkof: I have one thing for SWAT [14:24:04] zeljkof, yes [14:24:08] let me make a patch [14:24:22] Amir1: we should be done soon [14:24:31] cool [14:25:17] Jhs: all good now? [14:25:21] (03CR) 10Gehel: "Puppet compiler output: https://puppet-compiler.wmflabs.org/compiler02/9142/" [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:26:32] zeljkof, no… [14:26:40] oh noes :( [14:26:45] zeljkof, the namespace now seems to be named "Wiktionary", not "Wiccionaire", as if the patch wasn't applied [14:27:00] oooops [14:27:01] sorry [14:27:12] wrong script? [14:27:22] i dactivated the Chrome extension to show 1002 somehow [14:27:27] when i reactivated it looks correct [14:27:30] oh [14:27:38] ok, good to deploy? [14:27:43] yes [14:27:44] finally :) [14:27:47] deploying :) [14:28:36] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:394771|Localize sitename and meta NS for wawiktionary (T181782)]] (duration: 00m 46s) [14:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:46] T181782: Change project namespace name on Walloon Wiktionary to "Wiccionaire" - https://phabricator.wikimedia.org/T181782 [14:29:08] Jhs: deployed, please check and thanks for releasing with #releng ;) [14:29:21] Amir1: I am done, want to deploy your patch yourself? [14:29:34] if that's fine for you [14:29:37] sure [14:29:43] go ahead, swat is yours [14:30:04] Thanks [14:30:19] !log reboot druid100[23] for kernel updates [14:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:57] zeljkof, all looks good yeah :) thanks for the help & patience again :) [14:31:37] Jhs: I am glad I could help, both of us have learned something today :) [14:34:06] (03CR) 10Filippo Giunchedi: [C: 031] udp2log: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:34:29] (03PS6) 10Gehel: udp2log: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) [14:35:29] (03CR) 10Gehel: [C: 032] udp2log: use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/388426 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:35:44] zeljkof, uh oh https://wa.wiktionary.org/w/index.php?title=Sipeci%C3%A5s%3AIndecse+pa+betchete&prefix=&namespace=5 the last two pages give MWException :\ [14:35:56] 10Operations, 10ops-eqiad: Disconnect flerovium's disk shelves - https://phabricator.wikimedia.org/T181724#3809120 (10Cmjohnson) Yes, that is confirmed...i removed the disk shelves, rebooted the server w/out them attahed....powered off re-attached and powered on. That is the current state the server is in [14:36:27] Jhs: :( [14:36:44] Amir1: do you know what is the problem? ^ [14:37:15] !log gehel@tin Started deploy [kartotherian/deploy@e166d87]: dummy kartotherian deployment to test udp2log config change - T175242 [14:37:19] !log gehel@tin Finished deploy [kartotherian/deploy@e166d87]: dummy kartotherian deployment to test udp2log config change - T175242 (duration: 00m 03s) [14:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:28] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [14:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:39] no, I can't see any issue [14:38:43] 10Operations, 10ops-codfw: Disconnect furud's disk shelves - https://phabricator.wikimedia.org/T181725#3809125 (10faidon) 05stalled>03Open Per T181724, let's proceed with the original plan -- cc @Papaul :) [14:38:58] zeljkof: [WiVd5gpAICoAACB3hhEAAAAJ] 2017-12-04 14:38:30: Erreur fatale de type « MWException » [14:39:38] Amir1: clicking the last two links at this page https://wa.wiktionary.org/w/index.php?title=Sipeci%C3%A5s%3AIndecse+pa+betchete&prefix=&namespace=5 [14:40:20] that's horrible [14:40:35] I will look into it ASPA [14:40:57] their histories work, it's just the latest revisions that give the exception [14:41:18] Amir1: this is what I did (and below) https://phabricator.wikimedia.org/T181782#3809033 [14:43:12] (03PS1) 10Herron: puppet: cut over all puppet service records to codfw puppet 4 masters [dns] - 10https://gerrit.wikimedia.org/r/395003 (https://phabricator.wikimedia.org/T177254) [14:43:24] cool [14:46:52] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242#3809153 (10Gehel) All reference to logstash100[123] have been removed from puppet. I'll still do a check that no tra... [14:48:14] forgot the justification :( [14:48:23] Jhs, Amir1: I am not sure what to do, let me know if I can help [14:48:36] I will take care of the rest [14:48:57] Amir1: thanks! [14:49:02] (03PS3) 10Filippo Giunchedi: cassandra: reprovision restbase1014 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/394343 (https://phabricator.wikimedia.org/T179422) [14:49:19] let me know if I did something wrong, so I know for the next time [14:49:41] !log ladsgroup@tin Synchronized php-1.31.0-wmf.10/extensions/Wikibase: (no justification provided) (duration: 01m 41s) [14:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:33] !log deployed backward compatibility of entity compact diff transmit T113468 [14:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:41] T113468: [Task] Use compact representation of diffs in EntityChange. - https://phabricator.wikimedia.org/T113468 [14:50:42] (03PS4) 10Filippo Giunchedi: cassandra: reprovision restbase1014 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/394343 (https://phabricator.wikimedia.org/T179422) [14:51:32] !log cutting over all production puppet agents to codfw puppet 4 masters via dns [14:51:39] (03CR) 10Herron: [C: 032] puppet: cut over all puppet service records to codfw puppet 4 masters [dns] - 10https://gerrit.wikimedia.org/r/395003 (https://phabricator.wikimedia.org/T177254) (owner: 10Herron) [14:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:58] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase1014 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/394343 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [14:56:37] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 16 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[jobchron],Service[jobrunner] [14:59:32] (03PS2) 10Chad: Remove AdvancedSearch inclusion in beta, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394504 [14:59:38] (03PS3) 10Ema: mtail: add varnishmtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394597 (https://phabricator.wikimedia.org/T177199) [15:01:46] !log reimage restbase1014 - T179422 [15:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:57] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [15:03:08] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational [15:05:55] (03Abandoned) 10Eevans: hieradata: enable Cassandra instance: restbase1012-c [puppet] - 10https://gerrit.wikimedia.org/r/394603 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [15:09:52] (03CR) 10jenkins-bot: Remove AdvancedSearch inclusion in beta, it's in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394504 (owner: 10Chad) [15:13:06] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.page_props: Cant find record in page_props, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001580, end_log_pos 311464580 [15:13:31] arg [15:18:12] !log demon@tin Synchronized wmf-config/CommonSettings-labs.php: no-op (duration: 00m 45s) [15:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:13] (03PS1) 10Subramanya Sastry: Enable RemexHTML on itwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395009 (https://phabricator.wikimedia.org/T181188) [15:21:45] PROBLEM - mediawiki-installation DSH group on mw1259 is CRITICAL: Host mw1259 is not in mediawiki-installation dsh group [15:23:14] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1259.eqiad.wmnet [15:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:15] <_joe_> !log restarting puppetdb on nihal, will cause puppet failures [15:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:26] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 888.97 seconds [15:25:56] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:56] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:06] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:06] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:06] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:16] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:16] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:35] PROBLEM - puppet last run on elastic2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:36] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:36] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:36] PROBLEM - puppet last run on lawrencium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:36] PROBLEM - puppet last run on labtestneutron2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:36] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:36] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:37] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:37] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:45] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:46] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:56] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:56] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:26:56] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:05] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:06] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:06] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:06] PROBLEM - puppet last run on cp4022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:16] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:16] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:16] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:16] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:16] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:25] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:26] PROBLEM - puppet last run on db1100 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:26] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:26] PROBLEM - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:35] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:35] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:35] PROBLEM - puppet last run on labtestnet2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:36] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:36] <_joe_> heh it's gonna be more [15:27:44] <_joe_> herron: can you kill ircecho for now? [15:28:45] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:45] PROBLEM - puppet last run on dbproxy1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:45] PROBLEM - puppet last run on labnodepool1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:45] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:45] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:45] PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:06] PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:15] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:16] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:16] PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:25] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:25] PROBLEM - puppet last run on mc2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:25] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:26] PROBLEM - puppet last run on mw1316 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:26] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:26] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:26] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:26] PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:27] checking dbstore1002, AGAIN [15:29:27] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:36] PROBLEM - puppet last run on analytics1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:36] PROBLEM - puppet last run on ores2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:36] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:46] PROBLEM - puppet last run on mw1318 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:46] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:47] !log disabling ircecho temporarily [15:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:55] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:55] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:55] PROBLEM - puppet last run on conf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:55] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:56] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:56] PROBLEM - puppet last run on elastic1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:56] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:56] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:29:56] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:27] !log temporarily disabling puppet agents in eqiad and codfw while puppetdb catches up with command queue [15:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:41] (03PS5) 10Elukey: [WIP] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [15:46:25] !log demon@tin Started scap: Revert "Special:Preferences: Use OOjs UI" and follow-ups [15:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:14] hi bawolff [15:47:34] hello [15:49:02] (03CR) 10Elukey: "Tested on af-puppetdb01 in labs (thanks to Volans!) and this is what changes:" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [15:54:56] (03PS6) 10Elukey: [WIP] role::puppetmaster::puppetdb: add Prometheus monitoring for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/394966 [15:59:55] thanks no_justification btw [16:01:08] <_joe_> elukey: uh cool, basically all the data of the dashboard [16:01:56] MatmaRex: yw [16:02:05] No point in waiting til thursday! [16:02:11] _joe_ some metrics have super weird names with the deefault config, those might need some refinement :) [16:02:52] <_joe_> whatever :P [16:03:36] buut the really good thing is that the jmx exporter is able to reload its config when the config file changes [16:03:39] without daemon restart [16:04:34] (03PS1) 10Herron: puppet: change codfw puppet masters to use eqiad puppetdb server [puppet] - 10https://gerrit.wikimedia.org/r/395028 (https://phabricator.wikimedia.org/T177254) [16:05:11] !log re-enabling puppet agents in codfw [16:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:44] !log demon@tin Finished scap: Revert "Special:Preferences: Use OOjs UI" and follow-ups (duration: 21m 19s) [16:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:40] !log re-enabling puppet agents in eqiad [16:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:01] (03CR) 10Alexandros Kosiaris: [C: 031] "I guess +1 ?" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394618 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [16:18:01] MatmaRex: Why was OOUI for Special:Preferences patch reverted? [16:18:45] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T181779#3809495 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reque... [16:18:46] Niharika: see the commit message https://gerrit.wikimedia.org/r/#/c/394804/ [16:20:33] MatmaRex: Ah, thanks. It'd be helpful if you can also copy that in the Phab task. Lots of watchers. [16:21:13] hm, good point [16:21:48] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3809510 (10StevenJ81) Once again, let me emphasize what LangCom decided and the Board approved: We want the five redirects shown in the Descripti... [16:21:59] akosiaris: you guess +1? :D [16:22:11] (03PS4) 10Ema: mtail: add varnishmtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394597 (https://phabricator.wikimedia.org/T177199) [16:23:04] zeljkof: around? [16:23:15] Amir1: yes [16:23:20] zeljkof: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.12.04/mediawiki?id=AWAiTdINgaOKEclNVIIP&_g=h@44136fa [16:23:40] The bad news is the change is super complex to revert [16:23:45] RECOVERY - puppet last run on dbproxy1009 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:23:49] and we I don't have any good news [16:23:57] Amir1: :( [16:24:16] RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:24:22] zeljkof: but anomie need to look into this, he made the the first part and he knows better how to fix [16:24:25] RECOVERY - puppet last run on mc2019 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:24:27] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:24:27] RECOVERY - puppet last run on restbase2012 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:24:30] anomie: https://wa.wiktionary.org/wiki/Wiccionaire_copene:arabe_marokin [16:24:35] RECOVERY - puppet last run on analytics1044 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:24:35] RECOVERY - puppet last run on ores2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:24:38] and https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.12.04/mediawiki?id=AWAiTdINgaOKEclNVIIP&_g=h@44136fa [16:24:55] RECOVERY - puppet last run on mw1318 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:24:55] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:24:56] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:25:05] RECOVERY - puppet last run on puppetcompiler1001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:25:08] Amir1, anomie: I'm in meetings for the next 90 minutes, but let me know if I can do anything to fix the problem [16:25:15] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:25:15] RECOVERY - puppet last run on wtp2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:25:15] RECOVERY - puppet last run on mw2108 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:16] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:25:16] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:25:25] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:25:26] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:26] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:26] RECOVERY - puppet last run on dbproxy1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:45] RECOVERY - puppet last run on wtp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:45] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:45] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:25:46] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:25:46] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:25:46] RECOVERY - puppet last run on dumpsdata1002 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:25:46] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:25:47] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:25:51] zeljkof: i'd suggest filing a UBN bug because this might spread really fast and add anomie, legoktm, and if I'm not wrong Krinkle to it [16:25:55] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:26:05] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:05] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:05] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:26:05] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:06] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:06] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:15] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:26:15] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:26:15] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:26:15] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:15] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:16] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:16] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:17] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:17] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:26:18] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:25] RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:26] RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:26] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:26] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:26] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:26:26] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:26] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:27] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:27] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:26:35] RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:35] RECOVERY - puppet last run on mw2222 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:26:35] RECOVERY - puppet last run on elastic2021 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:36] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:36] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:36] RECOVERY - puppet last run on lawrencium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:36] RECOVERY - puppet last run on labtestneutron2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:37] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:37] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:37] (03PS1) 10Chad: Remove some old symlinks from noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395033 [16:26:39] (03CR) 10Chad: [C: 032] Remove some old symlinks from noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395033 (owner: 10Chad) [16:26:45] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:45] RECOVERY - puppet last run on mc1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:45] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:45] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:45] RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:46] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:56] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:26:56] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:56] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:05] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:27:05] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:06] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:06] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:13] I'm calling it a day [16:27:15] RECOVERY - puppet last run on druid1004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:27:16] o/ [16:27:16] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:16] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:16] RECOVERY - puppet last run on wtp1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:16] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:17] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:25] RECOVERY - puppet last run on mw1311 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:27:25] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:27:25] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:26] RECOVERY - puppet last run on db1100 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:27:26] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:26] RECOVERY - puppet last run on labtestmetal2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:27:34] volans: what did you want to hear ? it's the output of startproject :P [16:27:35] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:27:35] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:27:36] RECOVERY - puppet last run on labtestnet2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:27:36] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:27:45] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:27:45] RECOVERY - puppet last run on db2075 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:27:53] the output of startapp is obviously more interesting [16:27:57] akosiaris: lol [16:27:58] (03Merged) 10jenkins-bot: Remove some old symlinks from noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395033 (owner: 10Chad) [16:28:08] (03CR) 10jenkins-bot: Remove some old symlinks from noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395033 (owner: 10Chad) [16:28:15] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:28:27] Amir1: thanks [16:28:35] !log demon@tin Started scap: docroot/noc/conf/ drop some dangling symlinks [16:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:45] RECOVERY - puppet last run on db1086 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:28:45] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:28:45] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:28:45] RECOVERY - puppet last run on labnodepool1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:28:45] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:28:46] RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:28:46] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:28:47] RECOVERY - puppet last run on restbase-dev1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:28:47] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:28:55] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:06] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:29:15] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:29:15] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:15] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:29:16] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:16] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:29:25] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:29:25] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:29:26] RECOVERY - puppet last run on mw1315 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:26] RECOVERY - puppet last run on ores2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:26] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:29:26] RECOVERY - puppet last run on mw1316 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:26] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:29:27] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:29:35] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:29:36] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:29:36] RECOVERY - puppet last run on wdqs1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:36] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:46] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:29:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395034 [16:29:53] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395034 [16:29:55] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:55] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:29:55] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:29:55] RECOVERY - puppet last run on conf2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:29:55] RECOVERY - puppet last run on oresrdb2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:29:56] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:29:56] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:29:57] RECOVERY - puppet last run on elastic1050 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:05] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:30:16] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:30:21] (03CR) 10Volans: "A working demo in labs is available upon request (still not puppetized, WIP)" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [16:30:25] RECOVERY - puppet last run on oresrdb2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:30:25] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:30:25] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:30:25] RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:26] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:26] RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:30:26] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:30:45] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:45] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:45] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:45] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:30:46] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:56] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:30:56] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:05] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:05] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:05] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:05] RECOVERY - puppet last run on mw1306 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:15] RECOVERY - puppet last run on dbproxy1007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:31:15] RECOVERY - puppet last run on ms-be1024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:15] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:15] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:15] RECOVERY - puppet last run on wtp2011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:16] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:16] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:16] Jhs: could you create a phabricator task about the problem today during SWAT? I don't even know what went wrong :( [16:31:17] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:17] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:18] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:18] RECOVERY - puppet last run on ms-be2032 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:25] RECOVERY - puppet last run on db1106 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:31:25] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:25] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:31:25] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:25] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:25] (see Amir1's comments up) [16:31:45] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:46] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:46] RECOVERY - puppet last run on nitrogen is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:46] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:46] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:46] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:31:46] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:47] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:31:56] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:32:02] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395034 (owner: 10Marostegui) [16:32:05] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:05] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:05] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:05] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:05] RECOVERY - puppet last run on wtp1045 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:06] RECOVERY - puppet last run on elastic2030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:06] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:15] RECOVERY - puppet last run on wtp1047 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:15] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:15] RECOVERY - puppet last run on kubetcd2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:15] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:27] Amir1, zeljkof: At first glance, it doesn't seem to be related to my change. What's causing that is that the parser output is containing a section edit link marker for "Copene:Wiccionaire:arabe marokin", which is an invalid title. [16:32:28] !log demon@tin Finished scap: docroot/noc/conf/ drop some dangling symlinks (duration: 03m 53s) [16:32:35] RECOVERY - puppet last run on es2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:36] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:36] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:36] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:45] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:33:02] (03PS1) 10Chad: Adding new symlink for all dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395035 [16:33:04] (03CR) 10Chad: [C: 032] Adding new symlink for all dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395035 (owner: 10Chad) [16:33:05] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:33:07] Amir1, zeljkof: My change didn't do anything to the *generation* of section edit markers, just added a new flag for when to turn them into links versus when to remove them. [16:33:13] anomie: could you please create a phab ticket so it doesn't get lost? [16:33:16] RECOVERY - puppet last run on analytics1060 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:33:37] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395034 (owner: 10Marostegui) [16:33:53] zeljkof: I could, but I don't know anything more that "Here's this weird error, it's not related to my patch". [16:34:00] s/that/than/ [16:34:08] anomie: it's more than I know :) [16:34:15] it would be a great start [16:34:28] (03Merged) 10jenkins-bot: Adding new symlink for all dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395035 (owner: 10Chad) [16:34:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105:3312 - T174569 (duration: 00m 45s) [16:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:41] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:35:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395036 (https://phabricator.wikimedia.org/T174569) [16:35:20] zeljkof: A purge fixed it for that page, BTW. [16:35:59] anomie: there was another one, let me see... [16:36:25] RECOVERY - puppet last run on ms-be1039 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:36:49] anomie: this one is still broken https://wa.wiktionary.org/wiki/Wiccionaire_copene:Pronon%C3%A7aedje_zero-cnoxheu [16:36:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395034 (owner: 10Marostegui) [16:37:01] !log demon@tin Started scap: docroot/noc/conf/dblists [16:37:04] !log demon@tin scap aborted: docroot/noc/conf/dblists (duration: 00m 02s) [16:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:15] !log demon@tin Started scap: docroot/noc/conf/ Adding new dblist symlink [16:37:16] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395036 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [16:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:53] anomie: during swat, Jhs noticed last two pages here were broken https://wa.wiktionary.org/w/index.php?title=Sipeci%C3%A5s%3AIndecse+pa+betchete&prefix=&namespace=5 [16:38:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395036 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [16:38:44] zeljkof: Purging fixed that one too. This looks like a bug where pages are somehow being parsed with a Title "Wiccionaire_copene:Foobar" in NS_MAIN rather than "Foobar" in NS_PROJECT_TALK. [16:39:04] !log Deploy schema change on db1103:3312 - T174569 [16:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:28] anomie: thanks a lot, so immediate problem is resolved cc Jhs, Amir1 [16:39:43] anomie, purge as in ?action=purge ? [16:39:45] if you think there is a bug here, please create a task [16:39:51] Jhs: Yes [16:39:52] !log demon@tin Finished scap: docroot/noc/conf/ Adding new dblist symlink (duration: 02m 36s) [16:39:52] Jhs: yes [16:39:56] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3809576 (10akosiaris) Pasting various finding from ganeti1006 to try and get a clearer picture ``` Dec 3 08:01:32 ganeti1006 kernel: [318591.369562] qemu-system-x86: page alloc... [16:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:18] something something simplest solutions … thanks anomie & zeljkof & Amir1 :) [16:40:32] (03CR) 10Volans: [V: 032 C: 032] Created Django project [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394618 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [16:40:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1103:3312 - T174569 (duration: 00m 45s) [16:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:49] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [16:41:48] !log Deploy schema change on db1053 (s2) - T174569 [16:41:52] zeljkof: I suspect it was the swatting of https://gerrit.wikimedia.org/r/#/c/394771/, somehow or other things got confused for pages parsed in the middle of that change. Maybe something was using a cached version with the old name and something else using the updated name. [16:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:02] !log demon@tin Synchronized docroot/noc/conf/dblists: double checking symlink move (duration: 00m 44s) [16:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:26] (03CR) 10Muehlenhoff: [WIP] php7 manifests for mediawiki on stretch (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [16:56:06] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:56:28] (03CR) 10Paladox: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [16:57:25] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:28] 10Operations: etcd-mirror failure - https://phabricator.wikimedia.org/T181920#3809730 (10Joe) It was trying to cleanup. There are issues with twisted + deferToThread I never got around solving. The 'NoneType' + int is a bug in logging that I should indeed fix but is not critical. [17:08:13] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3809754 (10mmodell) >>! In T181661#3808791, @akosiaris wrote: > > So, scap closed the connection, 1m, 6s after the login. scap c... [17:08:58] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.5 (duration: 02m 58s) [17:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:42] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.6 (duration: 02m 29s) [17:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:41] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (blocked): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3809824 (10mobrovac) [17:16:56] 10Operations, 10Goal, 10Patch-For-Review: Provide dedicated database resources for wikidata - https://phabricator.wikimedia.org/T177208#3809842 (10jcrespo) [17:16:58] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Wikidata: Test moving testwikidatawiki database to s8 replica set on Wikimedia - https://phabricator.wikimedia.org/T180694#3809840 (10jcrespo) 05Open>03declined We are happy with the configuration on both eqiad and codfw, we do not need to test testwiki... [17:21:35] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3809898 (10mobrovac) [17:23:12] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), and 2 others: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3809916 (10akosiaris) [17:23:14] 10Operations, 10ORES, 10Scoring-platform-team: Problem with Redis server configuration on new ORES cluster - https://phabricator.wikimedia.org/T181806#3809914 (10akosiaris) 05Open>03Resolved So the boxes were rebooted some 18 days ago, and given the stresstest was supposed to be not lasting that long, th... [17:23:45] (03CR) 10jenkins-bot: Adding new symlink for all dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395035 (owner: 10Chad) [17:23:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395036 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [17:27:25] RECOVERY - puppet last run on analytics1069 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:37:44] (03CR) 10Zoranzoki21: [C: 031] Enable RemexHTML on itwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395009 (https://phabricator.wikimedia.org/T181188) (owner: 10Subramanya Sastry) [17:38:03] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 2 others: Enable wdqs-admin's to control nginx - https://phabricator.wikimedia.org/T181540#3809947 (10Gehel) Has been approved during weekly Ops meeting [17:38:06] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3809948 (10Gehel) Has been approved during weekly Ops meeting [17:40:04] (03CR) 10Dzahn: [C: 032] "has been approved in meeting" [puppet] - 10https://gerrit.wikimedia.org/r/393988 (https://phabricator.wikimedia.org/T181479) (owner: 10Dzahn) [17:41:00] !log demon@tin Synchronized php-1.31.0-wmf.10/extensions/CirrusSearch/maintenance/: fix some deprecated spam (duration: 00m 44s) [17:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:32] (03CR) 10Elukey: "Madhu let me also know if we need to backup data etc.. before proceeding :)" [puppet] - 10https://gerrit.wikimedia.org/r/394985 (https://phabricator.wikimedia.org/T181518) (owner: 10Elukey) [17:46:22] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3809968 (10demon) > What will happen if we try to checkout a project with git-lfs-enabled submodules on tin? Scap does not speak g... [17:51:35] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3809993 (10awight) >> And will scap be able to fetch and checkout on deployment targets? > > See above. The only caveat is that I'... [18:00:04] gehel: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171204T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:23] jouncebot: GUI update is planned for wdqs [18:01:07] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3810031 (10demon) Phab's already behind Varnish, Gerrit is not yet (cf some bug I don't have in front of me). But with objects this... [18:02:28] !log gehel@tin Started deploy [wdqs/wdqs@2873745]: wdqs GUI update [18:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:34] (03PS2) 10Dzahn: admins: add tjones to group 'restricted' [puppet] - 10https://gerrit.wikimedia.org/r/393988 (https://phabricator.wikimedia.org/T181479) [18:04:38] !log gehel@tin Finished deploy [wdqs/wdqs@2873745]: wdqs GUI update (duration: 02m 10s) [18:04:44] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3810042 (10awight) Good point, that won't work at all. On the bright side, we know that the git-lfs load is actually smaller than... [18:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:55] SMalyshev: wdqs deployment completed, tests are green [18:05:11] gehel: great, thanks! [18:06:23] AaronSchulz: no_justification: greg-g: Can we backport https://gerrit.wikimedia.org/r/394772 in the SWAT in ~1h? [18:06:33] (03CR) 10Dzahn: "[terbium:~] $ id tjones" [puppet] - 10https://gerrit.wikimedia.org/r/393988 (https://phabricator.wikimedia.org/T181479) (owner: 10Dzahn) [18:06:51] (03PS4) 10Gehel: Enable wdqs-admins to restart nginx [puppet] - 10https://gerrit.wikimedia.org/r/393814 (https://phabricator.wikimedia.org/T181540) (owner: 10Smalyshev) [18:07:25] (03CR) 10Hoo man: "Please consider T179317#3802980 before merging this." [puppet] - 10https://gerrit.wikimedia.org/r/394102 (https://phabricator.wikimedia.org/T179317) (owner: 10Dzahn) [18:07:35] (03CR) 10Gehel: [C: 032] Enable wdqs-admins to restart nginx [puppet] - 10https://gerrit.wikimedia.org/r/393814 (https://phabricator.wikimedia.org/T181540) (owner: 10Smalyshev) [18:08:50] hoo: Why wait? [18:08:50] :) [18:10:02] (03PS1) 10Awight: Try simplewiki ORES on beta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395052 (https://phabricator.wikimedia.org/T181848) [18:10:07] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3810059 (10Dzahn) 05stalled>03Resolved Hi @TJones your request has been approved and the Gerrit change is merged. Puppet created your user on terbium.eq... [18:11:02] PROBLEM - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:11:02] (03CR) 10jerkins-bot: [V: 04-1] Try simplewiki ORES on beta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395052 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [18:11:39] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 2 others: Enable wdqs-admin's to control nginx - https://phabricator.wikimedia.org/T181540#3810066 (10Dzahn) 05stalled>03Resolved This should be resolved now (https://gerrit.wikimedia.org/r/#/c/393814/4/modules/admin/data/data.yaml) Fee... [18:13:32] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): scap support for git-lfs - https://phabricator.wikimedia.org/T181855#3810070 (10demon) Yep I think we're on the same page here. [18:14:22] (03PS1) 10Jdlrobson: Enable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) [18:14:26] no_justification: If that's fine with you, I can also do it now [18:14:32] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:14:39] hoo: Go forth! [18:14:48] (03PS1) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [18:14:53] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [18:15:41] (03CR) 10jerkins-bot: [V: 04-1] role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [18:16:15] (03PS2) 10Awight: Try simplewiki ORES on beta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395052 (https://phabricator.wikimedia.org/T181848) [18:16:23] PROBLEM - cassandra-a service on restbase1014 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [18:16:30] that's me ^ silencing [18:16:48] si gerrit2001 downtime expired? [18:16:58] no_justification: let's make a ticket for that icinga-wm line, like "skip monitoring on inactive server" or "let gerrit process run on both" [18:17:01] elukey: yes [18:17:04] ^ [18:17:04] ah sorry :) [18:17:33] i was trying to find something to link to it though [18:17:44] (03CR) 10Zoranzoki21: [C: 031] Enable Page Previews EventLogging instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [18:18:06] (03CR) 10Zoranzoki21: [C: 031] Try simplewiki ORES on beta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395052 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [18:18:22] it might be phab search or me or we didnt have one [18:18:25] mutante: That'd be nice. We silenced it over the weekend [18:18:47] no_justification: the "skip icinga if on warm-standy" ? [18:18:59] Well, ideally we'd want it actually warm [18:19:01] RIght now it's kind of chilly :p [18:19:04] hehe, yea [18:19:07] So we'd want alerting [18:19:16] It's just kinda perma-broken right now pending the DB changes [18:19:17] (03PS2) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [18:19:33] yes, i know, blocked on that [18:19:51] (03CR) 10Jdlrobson: [C: 031] Add converted copyright svg images as png files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394820 (https://phabricator.wikimedia.org/T166684) (owner: 10Divadsn) [18:19:52] hmm, but didnt we have the ticket for that.. i just cant find it i geuss [18:21:29] (03PS3) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [18:21:49] (03CR) 10Zoranzoki21: [C: 031] Add converted copyright svg images as png files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394820 (https://phabricator.wikimedia.org/T166684) (owner: 10Divadsn) [18:23:51] i put it in scheduled downtime until March [18:24:04] ACKNOWLEDGEMENT - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn blocked on codfw db access [18:24:04] ACKNOWLEDGEMENT - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site daniel_zahn blocked on codfw db access [18:24:44] (03PS1) 10ArielGlenn: clean up old slow parse log datasets via cron [puppet] - 10https://gerrit.wikimedia.org/r/395058 (https://phabricator.wikimedia.org/T174421) [18:24:48] greg-g: heads up, I’m planning to deploy a beta cluster config change in-between windows, at 18:30 UTC. low priority if that’s not a good time. [18:25:36] (03CR) 10jerkins-bot: [V: 04-1] clean up old slow parse log datasets via cron [puppet] - 10https://gerrit.wikimedia.org/r/395058 (https://phabricator.wikimedia.org/T174421) (owner: 10ArielGlenn) [18:26:17] awight: should be fine [18:26:24] great, ty [18:27:01] (03PS2) 10ArielGlenn: clean up old slow parse log datasets via cron [puppet] - 10https://gerrit.wikimedia.org/r/395058 (https://phabricator.wikimedia.org/T174421) [18:27:49] moritzm: on mw1259 service jobchron is masked. is that intentional? i see you pooled it earlier [18:28:01] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission Vanadium - https://phabricator.wikimedia.org/T182015#3810108 (10Cmjohnson) [18:28:31] !log bootstrap restbase1014-a - T179422 [18:28:32] RECOVERY - cassandra-a service on restbase1014 is OK: OK - cassandra-a is active [18:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:43] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [18:29:01] urandom: ^ [18:31:01] (03PS1) 10Awight: Enable ORES on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395059 (https://phabricator.wikimedia.org/T181848) [18:31:10] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [18:31:29] (03CR) 10Awight: [C: 032] Try simplewiki ORES on beta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395052 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [18:31:54] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission server zinc - https://phabricator.wikimedia.org/T182016#3810128 (10Cmjohnson) [18:32:05] (03PS4) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [18:33:38] 10Operations, 10Ops-Access-Requests: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3810144 (10Dzahn) a:03Dzahn [18:34:50] (03Merged) 10jenkins-bot: Try simplewiki ORES on beta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395052 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [18:35:07] (03PS1) 10Dzahn: admins: add cparle to group 'restricted' [puppet] - 10https://gerrit.wikimedia.org/r/395061 (https://phabricator.wikimedia.org/T181626) [18:35:16] !log hoo@tin Synchronized php-1.31.0-wmf.10/includes/objectcache/ObjectCache.php: Only send statsd data for WAN cache in non-CLI mode (T181385) (duration: 00m 44s) [18:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:26] T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [18:35:53] (03PS2) 10Dzahn: admins: add cparle to group 'restricted' [puppet] - 10https://gerrit.wikimedia.org/r/395061 (https://phabricator.wikimedia.org/T181626) [18:36:45] (03CR) 10Dzahn: [C: 04-1] "was ldap-only user" [puppet] - 10https://gerrit.wikimedia.org/r/395061 (https://phabricator.wikimedia.org/T181626) (owner: 10Dzahn) [18:36:46] (03CR) 10Zoranzoki21: [C: 031] "Now should be ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395059 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [18:37:11] (03CR) 10jenkins-bot: Try simplewiki ORES on beta. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395052 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [18:37:29] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [18:39:44] The fix worked :) [18:39:48] I'm done backporting [18:40:40] !log Ran scap pull on snapshot1001 (T181385) [18:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:48] T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [18:40:57] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242#3810153 (10Gehel) Monitoring traffic for a few hours on logstash100[123] shows that nothing is coming into any of th... [18:43:06] (03PS5) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [18:43:15] (03PS3) 10Dzahn: admins: add cparle to group 'restricted' [puppet] - 10https://gerrit.wikimedia.org/r/395061 (https://phabricator.wikimedia.org/T181626) [18:46:40] !log Started dumpwikidatajson.sh on snapshot1007 (T181385) [18:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:51] T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [18:49:03] (03CR) 10Framawiki: robots.txt: Remove old and disabled archive.org_bot rule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/358171 (https://phabricator.wikimedia.org/T7582) (owner: 10Framawiki) [18:50:05] (03PS1) 10Awight: Add ORES filter thresholds for simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395066 (https://phabricator.wikimedia.org/T181848) [18:50:44] godog: thanks! [18:50:52] (03CR) 10Awight: [C: 032] Add ORES filter thresholds for simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395066 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [18:51:18] (03PS6) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [18:54:29] (03CR) 10Dzahn: [C: 032] "already had approvals" [puppet] - 10https://gerrit.wikimedia.org/r/395061 (https://phabricator.wikimedia.org/T181626) (owner: 10Dzahn) [18:55:26] (03PS7) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [18:57:46] awww.. i had to do that in 2 changes [18:57:58] knew but did it again anyways [18:58:09] (when you add a new user and also add it to a group) [18:59:12] (03PS3) 10ArielGlenn: clean up old slow parse log datasets via cron [puppet] - 10https://gerrit.wikimedia.org/r/395058 (https://phabricator.wikimedia.org/T174421) [19:00:03] (03CR) 10Phuedx: [C: 04-1] "Sorry. This is blocked on T180036. I think I should've moved T181493 to Blocked." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395053 (https://phabricator.wikimedia.org/T181493) (owner: 10Jdlrobson) [19:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171204T1900). [19:00:07] Jdlrobson and legoktm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:21] (03CR) 10ArielGlenn: [C: 032] clean up old slow parse log datasets via cron [puppet] - 10https://gerrit.wikimedia.org/r/395058 (https://phabricator.wikimedia.org/T174421) (owner: 10ArielGlenn) [19:00:29] (03PS1) 10Dzahn: admins: temp remove cparle from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/395069 [19:00:51] (03PS2) 10Dzahn: admins: temp remove cparle from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/395069 [19:01:06] hi [19:01:17] \o [19:02:53] (03PS3) 10Dzahn: admins: add missing command in restricted user list [puppet] - 10https://gerrit.wikimedia.org/r/395069 [19:03:09] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:03:13] FYI legoktm jdlrobson + unknown SWATter, I have an InitialiseSettings-labs.php lingering in the merge pipeline, I’ll coordinate again once/if it ever merges: https://gerrit.wikimedia.org/r/#/c/395066/ [19:03:18] (03PS4) 10Dzahn: admins: add missing command in restricted user list [puppet] - 10https://gerrit.wikimedia.org/r/395069 [19:03:28] mutante: Would you mind having a peek at https://gerrit.wikimedia.org/r/#/c/394203/? [19:03:32] (beta-only apache config) [19:03:35] (03CR) 10Dzahn: [V: 032 C: 032] admins: add missing command in restricted user list [puppet] - 10https://gerrit.wikimedia.org/r/395069 (owner: 10Dzahn) [19:04:07] jdlrobson: I guess I'll swat. Your first patch has a -1 on it from phuedx ? [19:04:24] awight: I can just sync it out since it's a no-op [19:04:29] (03Merged) 10jenkins-bot: Add ORES filter thresholds for simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395066 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [19:04:39] legoktm: great, ty. Ah there it goes ^ [19:05:34] legoktm: yeh i removed it from calendar. News to me :) [19:05:41] (03CR) 10Legoktm: [C: 032] Add converted copyright svg images as png files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394820 (https://phabricator.wikimedia.org/T166684) (owner: 10Divadsn) [19:05:44] legoktm: so can skip that one, config only :) [19:05:47] (03CR) 10Legoktm: [C: 032] Set $wgRestrictionMethod = 'firejail'; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393825 (https://phabricator.wikimedia.org/T173370) (owner: 10Legoktm) [19:06:04] (03CR) 10jenkins-bot: Add ORES filter thresholds for simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395066 (https://phabricator.wikimedia.org/T181848) (owner: 10Awight) [19:06:06] !log legoktm@tin Synchronized wmf-config/InitialiseSettings-labs.php: beta: Add ORES filter thresholds for simplewiki (duration: 00m 43s) [19:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:17] jdlrobson: is it ok if I just sync the images everywhere or do you want to test them on mwdebug first? [19:07:27] legoktm: fine for me yes [19:07:48] no mwdebug needed [19:08:09] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:08:09] (03PS8) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [19:09:12] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3810314 (10Dzahn) 05Open>03Resolved Hi @cparle your request has been approved and the code change has been merged. Puppet created your user on terbium... [19:09:43] (03Merged) 10jenkins-bot: Add converted copyright svg images as png files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394820 (https://phabricator.wikimedia.org/T166684) (owner: 10Divadsn) [19:10:01] (03CR) 10jenkins-bot: Add converted copyright svg images as png files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394820 (https://phabricator.wikimedia.org/T166684) (owner: 10Divadsn) [19:11:22] !log legoktm@tin Synchronized static/images/mobile/copyright/: Add converted copyright svg images as png files - https://gerrit.wikimedia.org/r/#/c/394820/ (duration: 00m 43s) [19:11:28] jdlrobson: ^ [19:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:57] (03PS2) 10Legoktm: Set $wgRestrictionMethod = 'firejail'; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393825 (https://phabricator.wikimedia.org/T173370) [19:12:10] thanks legoktm ! [19:12:13] np [19:12:21] (03CR) 10Legoktm: [C: 032] Set $wgRestrictionMethod = 'firejail'; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393825 (https://phabricator.wikimedia.org/T173370) (owner: 10Legoktm) [19:12:42] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to terbium.eqiad.wmnet for cparle - https://phabricator.wikimedia.org/T181626#3810333 (10Dzahn) P.S. Like all shell users, this also gives you a home dir on https://wikitech.wikimedia.org/wiki/People.wikimedia.org so you can use http... [19:13:25] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to terbium/wasat for Trey Jones - https://phabricator.wikimedia.org/T181479#3810339 (10Dzahn) P.S. Like all shell users, this also gives you a home dir on https://wikitech.wikimedia.org/wiki/People.wikimedia.org so you can use https:... [19:14:18] (03Merged) 10jenkins-bot: Set $wgRestrictionMethod = 'firejail'; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393825 (https://phabricator.wikimedia.org/T173370) (owner: 10Legoktm) [19:15:44] (03CR) 10jenkins-bot: Set $wgRestrictionMethod = 'firejail'; everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393825 (https://phabricator.wikimedia.org/T173370) (owner: 10Legoktm) [19:16:21] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Varnish and Apache debug tools and logs for hoo - https://phabricator.wikimedia.org/T179317#3810343 (10Dzahn) [19:17:12] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Set $wgRestrictionMethod = 'firejail'; everywhere (T173370) (duration: 00m 43s) [19:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:22] T173370: Support restricted execution of external commands (via firejail) - https://phabricator.wikimedia.org/T173370 [19:18:12] !log gehel@tin Started deploy [kartotherian/deploy@e166d87]: testing new kartotherian packaging on maps-test2003 [19:18:15] !log gehel@tin Finished deploy [kartotherian/deploy@e166d87]: testing new kartotherian packaging on maps-test2003 (duration: 00m 03s) [19:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:07] (03PS9) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [19:24:07] (03Abandoned) 10Eevans: cassandra: move machines from restbase to restbase_ng cluster [puppet] - 10https://gerrit.wikimedia.org/r/382506 (https://phabricator.wikimedia.org/T177501) (owner: 10Eevans) [19:24:52] 10Operations, 10Patch-For-Review, 10Services (doing): Prometheus cluster attribute for new RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T177501#3810361 (10Eevans) 05Open>03Resolved p:05Triage>03Low a:03Eevans Doing this would cause a loss of metric history for the machines with a... [19:25:15] 10Operations, 10Patch-For-Review, 10Services (done): Prometheus cluster attribute for new RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T177501#3810365 (10Eevans) [19:25:37] (03PS1) 10Anomie: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395073 (https://phabricator.wikimedia.org/T166733) [19:25:38] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3810367 (10Pchelolo) [19:27:06] legoktm: Still SWATting? I'd like to add one more. https://gerrit.wikimedia.org/r/395073 [19:27:12] * anomie is adding to Deployments now [19:27:32] anomie: you're in luck, I didn't close my terminals yet :) [19:27:51] (03CR) 10Legoktm: [C: 032] Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395073 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [19:29:29] (03Merged) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395073 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [19:29:40] (03CR) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395073 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [19:31:55] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis (T166733) (duration: 00m 43s) [19:32:01] anomie: ^ [19:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:07] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [19:32:45] legoktm: Worked [19:32:54] :D [19:32:56] emojis for everyone [19:44:03] (03PS10) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [19:48:29] (03PS11) 10Elukey: role::hadoop::master|standby: add Prometheus JMX exporter config [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) [19:50:37] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9155/" [puppet] - 10https://gerrit.wikimedia.org/r/395054 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [19:51:21] 10Operations, 10Operations-Software-Development: DNS repo: add CI checks for obvious configuration errors - https://phabricator.wikimedia.org/T182028#3810507 (10Volans) [20:01:38] (03CR) 10Smalyshev: [C: 031] wdqs: use the /readiness-probe in WDQS icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/394987 (https://phabricator.wikimedia.org/T181989) (owner: 10Gehel) [20:01:40] (03CR) 10Smalyshev: [C: 031] "Seems to be working for me." [puppet] - 10https://gerrit.wikimedia.org/r/394021 (https://phabricator.wikimedia.org/T173772) (owner: 10Gehel) [20:04:19] (03PS1) 10Anomie: Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395077 [20:04:21] (03CR) 10Anomie: [C: 032] Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395077 (owner: 10Anomie) [20:05:33] (03Merged) 10jenkins-bot: Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395077 (owner: 10Anomie) [20:05:55] (03CR) 10jenkins-bot: Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395077 (owner: 10Anomie) [20:07:08] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: Revert wgCommentTableSchemaMigrationStage change, breaks too much stuff (duration: 00m 43s) [20:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:36] HEH [20:08:41] (03PS3) 10Smalyshev: wdqs: cleanup JVM options for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/388026 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [20:08:48] (capslokc) [20:15:50] wotcha, any operations people with OTRS access about? Got a report that the IP 91.198.174.192 is being flagged by a couple of security tools, and is preventing some people connecting - it looks like some malware has taken to using calling `en.wikipedia.org` as a way of verifying an internet connection (https://www.virustotal.com/#/ip-address/91.198.174.192) [20:17:49] would appreciate an operations person to provide a second opinion on my proposed reply of, in not so many words, "not our end" [20:29:10] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission osm-web100[1-4] - https://phabricator.wikimedia.org/T182033#3810615 (10Cmjohnson) [20:30:56] 10Operations, 10ops-eqiad, 10DC-Ops: Decommission osm-cp100[1-4] - https://phabricator.wikimedia.org/T182034#3810632 (10Cmjohnson) [20:32:04] TheresNoTime: bleugh [20:32:38] all the fun of the fair \o/ [20:32:57] I'd love to know what files it reckons it's scanned... [20:33:09] 2017-11-27 [20:33:09] 52/68 [20:33:09] Win32 EXE [20:33:09] svchost10.exe [20:33:47] phab? [20:33:57] yeah? [20:34:07] I don't think there's anything to be done about it though [20:35:10] https://www.virustotal.com/#/file/5fafd3e13e22829612f2324ed8159a89745b3fdf5bc77c6de4e109fad671a4e1/behavior for example literally just GETs `en.wikipedia.org`, I imagine just to check its got a connection to the 'net [20:39:01] (03PS1) 10Herron: puppetdb: add command-processing threads setting to puppetdb::app [puppet] - 10https://gerrit.wikimedia.org/r/395080 (https://phabricator.wikimedia.org/T179722) [20:41:16] TheresNoTime, doesn't strictly need an opsen with OTRS access. you could probably forward the ticket to them [20:41:30] but yeah that's text-lb.esams, not a good IP to be blocking [20:44:05] Usually... We tell them to recheck.. And it goes through ok [20:45:11] I asked for which tools are flagging, things like Trend micro etc, but on Trend micro's site you can search an IP, and its not listed there.. [20:45:22] Could've been cleared already [20:45:50] (03PS1) 10Smalyshev: Fix deleting old categories [puppet] - 10https://gerrit.wikimedia.org/r/395082 [20:47:01] I'll ask them to recheck, but is there a sorta-private ops list I could forward this on to? [20:47:43] File a security task in phab and tag it operations [20:47:52] Good call [20:47:57] Usually, there's little/nothing we can actually do [20:48:09] Because the sites blocking us won't tell us what's actually wrong [20:48:45] Yeah, I imagine there won't be :) I *think* its more a "malware is calling back to that IP therefore that IP is doing C&C" [20:55:26] (03PS2) 10Addshore: Remove obsolete WikibaseQualityConstraints settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 (owner: 10Lucas Werkmeister (WMDE)) [20:56:10] (03CR) 10Addshore: "Well, the build is dead so we can probably move ahead with this now without worrying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392449 (owner: 10Lucas Werkmeister (WMDE)) [20:57:11] 10Operations, 10Security: OTRS ticket: Site registering Malware command and control - https://phabricator.wikimedia.org/T182038#3810747 (10Samtar) [20:57:38] Reedy: ^ should I protect that? [20:57:50] You should've reported it as a security bug ;( [20:57:52] *;) [20:58:08] (03PS1) 10Eevans: hieradata: enabled restbase1014-b for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/395085 (https://phabricator.wikimedia.org/T179422) [20:58:10] (03PS1) 10Eevans: hieradata: enable restbase1014-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/395086 (https://phabricator.wikimedia.org/T179422) [20:58:28] (03CR) 10Eevans: [C: 04-1] "Not yet ready" [puppet] - 10https://gerrit.wikimedia.org/r/395085 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [20:58:30] mew mew, I couldn't tag Operations straight off as a security bug :P [20:58:39] (03CR) 10Eevans: [C: 04-1] "Not yet ready." [puppet] - 10https://gerrit.wikimedia.org/r/395086 (https://phabricator.wikimedia.org/T179422) (owner: 10Eevans) [20:58:51] * TheresNoTime will protect it /now/ [20:59:07] Yeah [20:59:13] Just edit after that [21:00:05] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171204T2100). [21:00:05] No GERRIT patches in the queue for this window AFAICS. [21:00:20] no parsoid deploy today [21:00:33] ouch that was terrible #bothumor [21:00:49] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:00:50] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:00:50] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:01:29] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:01:29] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:01:40] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:03:04] (03CR) 10Addshore: [C: 04-1] "Needs a rebase" [puppet] - 10https://gerrit.wikimedia.org/r/387282 (owner: 10Hoo man) [21:03:39] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:07:11] Minor ORES upgrade en route... [21:07:18] !log gehel@tin Started deploy [kartotherian/deploy@6e223df]: testing new kartotherian packaging on maps-test2003 [21:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:38] !log gehel@tin Finished deploy [kartotherian/deploy@6e223df]: testing new kartotherian packaging on maps-test2003 (duration: 00m 20s) [21:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:45] I seem to have shot myself in the foot a few days ago: > 21:12:09 deploy failed: Failed to acquire lock "/var/lock/scap.ores_deploy.lock"; owner is "awight"; reason is "(non-production) Test ORES deployment to ores100*" [21:13:20] oh, I own it. [21:13:26] !log awight@tin Started deploy [ores/deploy@6baed71]: Update ORES to 6baed71 [21:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:33] awight: Users own their own lockfiles :) [21:16:38] (that would be super frustrating if not, heh) [21:17:31] (03PS4) 10Ottomata: Improvements for Kafka + SSL [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) [21:18:11] (03CR) 10jerkins-bot: [V: 04-1] Improvements for Kafka + SSL [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [21:18:12] It would make it difficult to operate on my own foot [21:19:51] (03CR) 10Ottomata: "Luca, ok, I got rid of super.users. This should work. :)" [puppet] - 10https://gerrit.wikimedia.org/r/394438 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [21:22:50] (03PS1) 10Hashar: contint: a slave script will require 'jq' [puppet] - 10https://gerrit.wikimedia.org/r/395097 (https://phabricator.wikimedia.org/T181938) [21:24:19] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:26:20] !log awight@tin Finished deploy [ores/deploy@6baed71]: Update ORES to 6baed71 (duration: 12m 54s) [21:26:21] !log ppchelko@tin Started deploy [cpjobqueue/deploy@c8bea2e]: (no justification provided) [21:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:39] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [21:26:40] RECOVERY - Disk space on stat1005 is OK: DISK OK [21:26:50] RECOVERY - DPKG on stat1005 is OK: All packages OK [21:26:59] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [21:26:59] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [21:27:00] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [21:27:03] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@c8bea2e]: (no justification provided) (duration: 00m 42s) [21:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:40] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [21:33:08] (03CR) 10Reedy: [C: 031] contint: a slave script will require 'jq' [puppet] - 10https://gerrit.wikimedia.org/r/395097 (https://phabricator.wikimedia.org/T181938) (owner: 10Hashar) [21:34:58] 10Operations, 10ORES, 10Release-Engineering-Team, 10Scoring-platform-team (Current): Connection timeout from tin to new ores servers - https://phabricator.wikimedia.org/T181661#3810860 (10mmodell) [21:35:19] (03CR) 10Krinkle: [C: 031] contint: a slave script will require 'jq' [puppet] - 10https://gerrit.wikimedia.org/r/395097 (https://phabricator.wikimedia.org/T181938) (owner: 10Hashar) [21:40:57] mutante: yeah, that's intentional. this is used for controlled tests with the video scalers on stretch [21:41:57] moritzm: gotcha, thx [21:42:38] (03CR) 10Dzahn: [C: 032] contint: a slave script will require 'jq' [puppet] - 10https://gerrit.wikimedia.org/r/395097 (https://phabricator.wikimedia.org/T181938) (owner: 10Hashar) [21:46:21] (03PS3) 10Dzahn: diadem/dysprosium: introduce skeleton role [puppet] - 10https://gerrit.wikimedia.org/r/394624 (https://phabricator.wikimedia.org/T169566) [21:52:04] (03CR) 10Dzahn: [C: 032] diadem/dysprosium: introduce skeleton role [puppet] - 10https://gerrit.wikimedia.org/r/394624 (https://phabricator.wikimedia.org/T169566) (owner: 10Dzahn) [21:54:20] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Mon 2017-12-04 21:54:16 UTC. [21:54:58] (03PS3) 10Dzahn: labnodepool: move standard/firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/392769 [22:00:04] dapatrick, bawolff, and Reedy: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171204T2200). [22:00:05] No GERRIT patches in the queue for this window AFAICS. [22:00:29] when did jouncebot start trying to be funny? [22:01:04] awhile ago now [22:01:28] (03PS4) 10Dzahn: labnodepool: move standard/firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/392769 [22:02:00] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:02:25] (03CR) 10Dzahn: [C: 032] labnodepool: move standard/firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/392769 (owner: 10Dzahn) [22:08:09] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:08:41] !log brion running requeueTranscodes.php on terbium to batch-run .mp3 output on Commons (T181749) [22:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:53] T181749: Batch-generate mp3 audio transcodes for existing ogg/opus/wav/flac files - https://phabricator.wikimedia.org/T181749 [22:09:01] \o/ [22:09:04] our bot overlords [22:09:45] (03CR) 10Dzahn: "no-op on labnodepool1001 (the only one)" [puppet] - 10https://gerrit.wikimedia.org/r/392769 (owner: 10Dzahn) [22:10:14] (03PS1) 10Dzahn: Revert "contint: a slave script will require 'jq'" [puppet] - 10https://gerrit.wikimedia.org/r/395118 [22:10:30] (03PS2) 10Dzahn: Revert "contint: a slave script will require 'jq'" [puppet] - 10https://gerrit.wikimedia.org/r/395118 [22:10:57] (03CR) 10Dzahn: [C: 032] "this breaks puppet on contint1001 because package jq is already defined in "standard" , duplicate declaration" [puppet] - 10https://gerrit.wikimedia.org/r/395118 (owner: 10Dzahn) [22:13:09] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [22:21:57] (03PS1) 10Krinkle: multiversion: Update docs for 'wfShellWikiCmd' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395120 [22:27:31] (03PS1) 10Dzahn: lvs esams: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395125 (https://phabricator.wikimedia.org/T177225) [22:30:29] (03PS2) 10Subramanya Sastry: Enable RemexHTML on itwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395009 (https://phabricator.wikimedia.org/T181188) [22:30:31] (03PS1) 10Subramanya Sastry: Enable RemexHTML on wikis with zero high priority linter errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395139 (https://phabricator.wikimedia.org/T182042) [22:32:00] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:37:46] (03PS2) 10Dzahn: lvs esams: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395125 (https://phabricator.wikimedia.org/T177225) [22:49:01] (03CR) 10Dzahn: [C: 032] lvs esams: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395125 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:02:31] (03PS1) 10Dzahn: lvs esams: fix regex to remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395144 [23:03:34] (03PS2) 10Dzahn: lvs esams: fix regex to remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395144 [23:10:52] (03CR) 10Dzahn: [C: 032] lvs esams: fix regex to remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/395144 (owner: 10Dzahn) [23:19:05] (03PS1) 10Dzahn: remove esams ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395147 (https://phabricator.wikimedia.org/T177225) [23:20:44] (03CR) 10Krinkle: [C: 032] multiversion: Update docs for 'wfShellWikiCmd' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395120 (owner: 10Krinkle) [23:21:51] no_justification: Got a dblists/.htaccess file untracked :) [23:22:02] Oh whoops I was testing something [23:22:05] Nuke it [23:22:09] (03Merged) 10jenkins-bot: multiversion: Update docs for 'wfShellWikiCmd' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395120 (owner: 10Krinkle) [23:22:24] (03CR) 10jenkins-bot: multiversion: Update docs for 'wfShellWikiCmd' hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395120 (owner: 10Krinkle) [23:22:26] k [23:22:36] rm'ed [23:24:27] anomie: Can you re-verify the revert regarding $wgCommentTableSchemaMigrationStage? [23:24:40] I did git pull just now on tin and that brought in the actual revert [23:24:50] (03PS2) 10Dzahn: remove esams ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395147 (https://phabricator.wikimedia.org/T177225) [23:24:54] which suggests it wasn't before, unless it was deployed from elsewhere or rewound on tin. [23:25:26] terbium says testwiki still has pre-revert state of $wgCommentTableSchemaMigrationStage = 1 (MIGRATION_WRITE_BOTH) [23:28:22] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: Revert wgCommentTableSchemaMigrationStage change, breaks too much stuff (duration: 00m 44s) [23:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:43] (03PS1) 10Krinkle: Revert "Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395149 [23:28:53] (03CR) 10Krinkle: [V: 032 C: 032] "Re-reverting because patch wasn't deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395149 (owner: 10Krinkle) [23:29:09] anomie: Sorry, missed by a minute. [23:29:31] (03CR) 10jenkins-bot: Revert "Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395149 (owner: 10Krinkle) [23:29:37] anomie: I'll re-revert. Didn't realise you were still online. [23:29:39] Thanks for the sync. [23:30:28] (03PS1) 10Krinkle: Re-apply "Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395150 [23:30:34] (03CR) 10Krinkle: [C: 032] Re-apply "Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395150 (owner: 10Krinkle) [23:31:12] (03CR) 10Krinkle: [C: 032] "Deployed by Anomie." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395150 (owner: 10Krinkle) [23:31:23] 10Operations, 10Release-Engineering-Team (Watching / External), 10User-Joe: [DRAFT][RfC] Deployment of python applications in production - https://phabricator.wikimedia.org/T180023#3811081 (10greg) [23:31:28] (03CR) 10Dzahn: [C: 032] remove esams ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395147 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [23:32:00] (03Merged) 10jenkins-bot: Re-apply "Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395150 (owner: 10Krinkle) [23:32:09] (03PS3) 10Dzahn: remove esams ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/395147 (https://phabricator.wikimedia.org/T177225) [23:32:44] (03CR) 10jenkins-bot: Re-apply "Revert "Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395150 (owner: 10Krinkle) [23:40:20] PROBLEM - DPKG on bast3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:41:33] ^ me [23:41:42] killing the first aggregator [23:43:20] RECOVERY - DPKG on bast3002 is OK: All packages OK [23:44:13] !log bast3002 - killall -u ganglia to kill all aggregator procs, apt-get remove --purge ganglia-monitor, rm -rf /etc/ganglia, rm -rf /usr/lib/ganglia, apt-get autoremove [23:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:39] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.page_props: Cant find record in page_props, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001583, end_log_pos 135615650 [23:58:39] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes