[00:53:51] (03Abandoned) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485 (owner: 10Zoranzoki21) [02:20:13] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.4444 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [02:21:13] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1749 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [02:21:22] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [02:22:23] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [02:37:12] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.3974 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [02:38:12] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2842 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [02:39:22] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [02:40:23] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [02:56:36] (03PS1) 10KartikMistry: apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/468909 (https://phabricator.wikimedia.org/T189076) [02:57:08] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/468909 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [03:09:13] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1082 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:09:22] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1211 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:13:43] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [03:13:52] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.03924 https://grafana.wikimedia.org/dashboard/db/logstash [03:31:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 880.48 seconds [03:52:23] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 241.86 seconds [04:04:02] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.4357 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [04:04:03] (03PS1) 10KartikMistry: apertium-cat: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/468913 (https://phabricator.wikimedia.org/T189076) [04:04:05] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/468913 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [04:05:12] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [04:23:09] the upload wizard on Commons needs update with more languages [04:23:30] can this be done by admins? [04:24:00] is there a tracking phab ticket about this? [04:31:52] !log kartik@deploy1001 Started deploy [cxserver/deploy@904151f]: Update cxserver to eee8974 (T207070, T203077, T199529) [04:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:59] T199529: Do not try to adapt the transclusion fragements - https://phabricator.wikimedia.org/T199529 [04:32:00] T207070: Link elements with data-mw attributes are adapted as Links instead of templates - https://phabricator.wikimedia.org/T207070 [04:32:00] T203077: Performance analysis for translate API - https://phabricator.wikimedia.org/T203077 [04:37:35] !log kartik@deploy1001 Finished deploy [cxserver/deploy@904151f]: Update cxserver to eee8974 (T207070, T203077, T199529) (duration: 05m 42s) [04:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:40] T199529: Do not try to adapt the transclusion fragements - https://phabricator.wikimedia.org/T199529 [04:37:41] T207070: Link elements with data-mw attributes are adapted as Links instead of templates - https://phabricator.wikimedia.org/T207070 [04:37:41] T203077: Performance analysis for translate API - https://phabricator.wikimedia.org/T203077 [05:00:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:02:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:15:13] PROBLEM - Device not healthy -SMART- on db2061 is CRITICAL: cluster=mysql device=cciss,1 instance=db2061:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [05:21:33] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2061 is CRITICAL: cluster=mysql device=cciss,1 instance=db2061:9100 job=node site=codfw Marostegui lets wait for the disk to finally fail before replacing it - The acknowledgement expires at: 2018-11-13 05:21:08. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [05:24:40] (03PS1) 10Marostegui: db-codfw.php: Clarify db2033 BBU status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468914 (https://phabricator.wikimedia.org/T184888) [05:26:56] (03CR) 10Marostegui: [C: 032] db-codfw.php: Clarify db2033 BBU status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468914 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [05:28:10] (03Merged) 10jenkins-bot: db-codfw.php: Clarify db2033 BBU status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468914 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [05:29:17] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify db2033 BBU status (duration: 00m 49s) [05:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:25] (03CR) 10jenkins-bot: db-codfw.php: Clarify db2033 BBU status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468914 (https://phabricator.wikimedia.org/T184888) (owner: 10Marostegui) [05:31:49] !log Deploy schema change on dbstore2002:3313 - T204006 [05:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:52] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [05:36:03] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on einsteinium is OK: (C)130 ge (W)110 ge 93.59 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [05:47:35] !log Deploy schema change on s3 db2074 (and db2094 sanitarium) - T204006 [05:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:38] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [05:56:55] (03PS1) 10Elukey: eventlogging: don't send logs to syslog [puppet] - 10https://gerrit.wikimedia.org/r/468915 [05:57:39] (03PS2) 10Elukey: eventlogging: don't send logs to the syslog logfile [puppet] - 10https://gerrit.wikimedia.org/r/468915 [06:00:09] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13123/" [puppet] - 10https://gerrit.wikimedia.org/r/468915 (owner: 10Elukey) [06:00:28] !log Deploy schema change on db2057 - T204006 [06:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:31] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [06:11:41] !log Deploy schema change on db2050 - T204006 [06:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:44] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [06:29:02] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:30:32] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:31:33] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/prometheus] [06:34:44] !log Deploy schema change on db2036 - T204006 [06:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:47] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [06:39:36] <_joe_> uhm [06:39:38] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: convert to using mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/468583 [06:39:40] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikimania too [puppet] - 10https://gerrit.wikimedia.org/r/468929 [06:39:42] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove now unused common includes [puppet] - 10https://gerrit.wikimedia.org/r/468930 [06:39:44] (03PS1) 10Giuseppe Lavagetto: [PoC] How I'd like the mediawiki vhosts to be [puppet] - 10https://gerrit.wikimedia.org/r/468931 [06:39:45] <_joe_> wikibugs has lag [06:39:48] <_joe_> elukey: ^^ [06:40:13] ah the code review, I was puzzled for a second :D [06:40:14] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::prod_sites: convert wikimania too [puppet] - 10https://gerrit.wikimedia.org/r/468929 (owner: 10Giuseppe Lavagetto) [06:40:55] (03CR) 10jerkins-bot: [V: 04-1] [PoC] How I'd like the mediawiki vhosts to be [puppet] - 10https://gerrit.wikimedia.org/r/468931 (owner: 10Giuseppe Lavagetto) [06:54:38] (03CR) 10Muehlenhoff: [C: 031] scap::scripts: conditionally require mediawiki::users [puppet] - 10https://gerrit.wikimedia.org/r/468563 (https://phabricator.wikimedia.org/T207487) (owner: 10Elukey) [06:56:02] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:50] (03Abandoned) 10Elukey: scap::scripts: conditionally require mediawiki::users [puppet] - 10https://gerrit.wikimedia.org/r/468563 (https://phabricator.wikimedia.org/T207487) (owner: 10Elukey) [06:59:33] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:59] (03CR) 10Muehlenhoff: [C: 031] "Looks good, I also doublechecked dashboards and with one exception none of them use any metrics collected via Diamond. The exception is "m" [puppet] - 10https://gerrit.wikimedia.org/r/466905 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [07:12:53] PROBLEM - IPMI Sensor Status on scb2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [07:19:33] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1544 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [07:22:52] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [07:24:52] !log reformat ms-be2042 - T199198 [07:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:56] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [07:30:51] 10Operations, 10ops-codfw: scb2001: Power supply failure - https://phabricator.wikimedia.org/T207629 (10MoritzMuehlenhoff) [07:31:14] ACKNOWLEDGEMENT - IPMI Sensor Status on scb2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Muehlenhoff T207629 [07:38:58] !log rebooting swift-be servers in eqiad for kernel security update [07:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:17] (03CR) 10Muehlenhoff: "@Banyek: That PCC output is the expected outcome, for background: We're deprecating Diamond for metrics collection in favour of Prometheus" [puppet] - 10https://gerrit.wikimedia.org/r/467264 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [07:52:41] !log Disconnect codfw -> eqiad replication on s1 (db1067) [07:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:17] !log Disconnect codfw -> eqiad replication on s2 (db1066) [07:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:43] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:58:27] (03PS2) 10Giuseppe Lavagetto: mediawiki: remove base class, superseded by profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/468253 [07:59:44] !log Disconnect codfw -> eqiad replication on s4 (db1068) [07:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:53] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:01:15] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: remove base class, superseded by profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/468253 (owner: 10Giuseppe Lavagetto) [08:01:41] !log Disconnect codfw -> eqiad replication on s6 (db1061) [08:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:10] !log Disconnect codfw -> eqiad replication on s7 (db1062) [08:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:32] !log Disconnect codfw -> eqiad replication on s8 (db1071) [08:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:00] !log Disconnect codfw -> eqiad replication on x1 (db1069) [08:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:02] !log Disconnect codfw -> eqiad replication on es2 (es1015) [08:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:01] (03PS1) 10Elukey: mcrouter: add the probe_delay_initial_ms parameter [puppet] - 10https://gerrit.wikimedia.org/r/468935 (https://phabricator.wikimedia.org/T203786) [08:13:08] !log Disconnect codfw -> eqiad replication on es3 (es1017) [08:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:31] (03CR) 10Volans: "@akosiaris, @gehel, @_joe_: what are your thoughts on this one? This quarterly goal will create additional cookbooks and will be nice to s" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:16:13] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Very interesting discovery today. The probe_delay_initial_ms... [08:16:38] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13124/mw1347.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/468935 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [08:19:55] !log Disconnect codfw -> eqiad replication on s3 (db1075) [08:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:49] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10MoritzMuehlenhoff) >>! In T207533#4683780, @Andrew wrote: > My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of f... [08:21:02] RECOVERY - Filesystem available is greater than filesystem size on ms-be2042 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [08:22:22] !log Disconnect codfw -> eqiad replication on s5 (db1070) [08:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468936 (https://phabricator.wikimedia.org/T184805) [08:27:44] !log Deploy schema change on db2043 (s3 master) without replication - T204006 [08:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:48] T204006: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 [08:28:35] !log performing deletes on db1087 to fix wb_terms on labs [08:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:48] !log powercycling ms-be1018, stuck during reboot [08:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468936 (https://phabricator.wikimedia.org/T184805) (owner: 10Marostegui) [08:31:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468936 (https://phabricator.wikimedia.org/T184805) (owner: 10Marostegui) [08:32:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 - T184805 (duration: 00m 47s) [08:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:02] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [08:37:11] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468936 (https://phabricator.wikimedia.org/T184805) (owner: 10Marostegui) [08:38:32] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10Krenair) >>! In T207533#4684756, @MoritzMuehlenhoff wrote: >>>! In T207533#4683780, @Andrew wrote: >> My only concern about this is that those recursors are used about every second on eve... [08:38:43] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.3187 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:38:53] _joe_ I applied the change manually to mw2210, looks good [08:39:02] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.3703 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [08:39:24] I'll take a look at the logstash loss [08:39:52] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [08:40:12] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [08:46:53] (03PS1) 10Giuseppe Lavagetto: scap::scripts: remove unneeded include of mediawiki::users [puppet] - 10https://gerrit.wikimedia.org/r/468937 [08:46:58] <_joe_> elukey: ^^ [08:49:02] (03CR) 10Elukey: [C: 031] scap::scripts: remove unneeded include of mediawiki::users [puppet] - 10https://gerrit.wikimedia.org/r/468937 (owner: 10Giuseppe Lavagetto) [08:49:06] \o/ [08:49:32] can I merge? [08:52:24] <_joe_> elukey: do a compiler run just in case? [08:53:50] _joe_ already done and added to the code review [08:53:57] (03PS4) 10Alex Monk: mediawiki::web::beta_sites: convert to using mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/468583 (https://phabricator.wikimedia.org/T1256) (owner: 10Giuseppe Lavagetto) [08:53:59] https://gerrit.wikimedia.org/r/468935 [08:54:18] ahh snap sorry for the scap::scripts [08:54:25] brain panic [08:54:29] okok going to do it [08:58:03] !log Stop replication in sync on db1100 and db2052 (codfw master) to reimport wikis - T184805 [08:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:07] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [08:58:16] <_joe_> Krenair: I intend to test that in beta in a few [08:58:30] (03PS1) 10Filippo Giunchedi: logstash: bump default receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/468938 (https://phabricator.wikimedia.org/T200960) [08:59:43] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.5996 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [09:00:03] _joe_ looks good https://puppet-compiler.wmflabs.org/compiler1002/13126/ [09:00:20] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13126/" [puppet] - 10https://gerrit.wikimedia.org/r/468937 (owner: 10Giuseppe Lavagetto) [09:01:08] (03CR) 10Elukey: [C: 032] scap::scripts: remove unneeded include of mediawiki::users [puppet] - 10https://gerrit.wikimedia.org/r/468937 (owner: 10Giuseppe Lavagetto) [09:02:00] (03CR) 10Elukey: [C: 032] deployment-prep: add turnilo to scap repos on the deploy host [puppet] - 10https://gerrit.wikimedia.org/r/468568 (owner: 10Elukey) [09:02:02] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.01224 https://grafana.wikimedia.org/dashboard/db/logstash [09:02:08] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move labmon (Graphite, StatsD) into a Cloud VPS - https://phabricator.wikimedia.org/T207543 (10aborrero) p:05Triage>03Lowest We have dedicated hardware for this, in the case of labmon1002, a fairly new server which was put into works some months... [09:02:10] (03CR) 10Filippo Giunchedi: [C: 032] logstash: bump default receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/468938 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [09:02:18] (03PS2) 10Filippo Giunchedi: logstash: bump default receive buffer [puppet] - 10https://gerrit.wikimedia.org/r/468938 (https://phabricator.wikimedia.org/T200960) [09:03:31] gehel: FYI, looks like wdqs is spamming with ERRORs in the form of Invalid format for the WKT value: [09:03:50] I'm bumping the receive buffers on logstash but likely that'll help only partially [09:04:17] 10Operations, 10Cloud-VPS: Move various support services for Cloud VPS currently in prod into their own instances - https://phabricator.wikimedia.org/T207536 (10aborrero) For the record, I don't think we should move openstack components inside openstack just to avoid complex chicken-egg problems. Specifically... [09:04:23] !log Run mydumper on db1100 for enwikivoyage cebwiki shwiki srwiki mgwiktionary - T184805 [09:04:24] <_joe_> gehel: ^^ [09:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:26] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [09:04:31] <_joe_> and onimisionipe too :) [09:05:20] * onimisionipe is looking [09:05:23] (03PS2) 10Elukey: deployment-prep: add turnilo to scap repos on the deploy host [puppet] - 10https://gerrit.wikimedia.org/r/468568 [09:08:04] (03PS3) 10Elukey: deployment-prep: add turnilo to scap repos on the deploy host [puppet] - 10https://gerrit.wikimedia.org/r/468568 [09:09:18] _joe_, cool [09:09:31] was just putting the task no. in the commit message :) [09:10:03] <_joe_> Krenair: I'll cherry-pick this change, deploy it to one of the servers, if all looks good, I'll merge it and deploy it [09:10:50] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-deploy* - https://phabricator.wikimedia.org/T207487 (10elukey) 05Open>03Resolved Fixed by Joe with https://gerrit.wikimedia.org/r/468937 [09:11:32] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.29 seconds [09:11:42] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 691.26 seconds [09:11:43] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 695.31 seconds [09:11:45] that is me [09:11:51] I downtimed the wrong section, sorry! [09:11:53] PROBLEM - MariaDB Slave Lag: s5 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 705.25 seconds [09:11:53] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 705.36 seconds [09:12:03] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 713.67 seconds [09:12:06] actually not [09:12:09] I did downtime them [09:12:12] PROBLEM - MariaDB Slave Lag: s5 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 724.28 seconds [09:12:13] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 724.29 seconds [09:12:23] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 734.37 seconds [09:12:42] (03PS2) 10Elukey: mcrouter: add the probe_delay_initial_ms parameter [puppet] - 10https://gerrit.wikimedia.org/r/468935 (https://phabricator.wikimedia.org/T203786) [09:13:12] ah no, my regex was wrong [09:13:14] well done [09:16:45] !log Remove replication filters from db2052 (s5 codfw master) - T184805 [09:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:49] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [09:18:29] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui) >>! In T184805#4654953, @Marostegui wrote: > This was done successfully and new wikis are now live on eqiad. > What is pending now is: > - Run... [09:21:11] godog: We should probably throttle these errors more. I will gehel and SMalyshev on this. [09:21:18] Thanks! [09:22:06] indeed, thanks onimisionipe ! [09:22:10] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [09:22:13] 10Operations, 10Traffic, 10Patch-For-Review: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) [09:23:16] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [09:23:19] 10Operations, 10Traffic, 10Patch-For-Review: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 (10ema) [09:24:20] 10Operations, 10Traffic: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) [09:24:23] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [09:25:42] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.5691 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [09:26:43] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [09:26:47] (03CR) 10Giuseppe Lavagetto: [C: 032] "Cherry-picked on beta, a smoke test of some important basic urls works as expected (via apache-fast-test). Merging." [puppet] - 10https://gerrit.wikimedia.org/r/468583 (https://phabricator.wikimedia.org/T1256) (owner: 10Giuseppe Lavagetto) [09:26:59] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: convert to using mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/468583 (https://phabricator.wikimedia.org/T1256) [09:34:01] (03PS3) 10Marostegui: site.pp: Comment fixes due to dewiki no longer being the only s5 wiki [puppet] - 10https://gerrit.wikimedia.org/r/464797 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [09:34:56] (03CR) 10Marostegui: [C: 032] site.pp: Comment fixes due to dewiki no longer being the only s5 wiki [puppet] - 10https://gerrit.wikimedia.org/r/464797 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [09:39:47] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) >>! In T174596#4681328, @ayounsi wrote: > This is also the reason we have to have the... [09:40:40] (03CR) 10Giuseppe Lavagetto: [C: 031] mcrouter: add the probe_delay_initial_ms parameter [puppet] - 10https://gerrit.wikimedia.org/r/468935 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [09:42:24] (03PS3) 10Elukey: mcrouter: add the probe_delay_initial_ms parameter [puppet] - 10https://gerrit.wikimedia.org/r/468935 (https://phabricator.wikimedia.org/T203786) [09:43:46] going to disabled puppet on most of the mw servers for --^ [09:43:56] if it is a problem, lemme know [09:44:29] (03CR) 10Elukey: [C: 032] mcrouter: add the probe_delay_initial_ms parameter [puppet] - 10https://gerrit.wikimedia.org/r/468935 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [09:47:00] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) a:05jcrespo>03Marostegui [09:50:08] re-enabling puppet on mw2* [09:58:08] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: exclude SNAT between VMs contacting floating IPs [puppet] - 10https://gerrit.wikimedia.org/r/468940 (https://phabricator.wikimedia.org/T206261) [09:59:32] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Sanitizing input and increase throttling rate for wdqs errors to prevent spamming logstash - https://phabricator.wikimedia.org/T207643 (10Mathew.onipe) [10:00:36] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikimania too [puppet] - 10https://gerrit.wikimedia.org/r/468929 [10:03:57] !log icinga downtime for cloudnet1003/4 for T206261 [10:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:00] T206261: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 [10:04:09] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: eqiad1: exclude SNAT between VMs contacting floating IPs [puppet] - 10https://gerrit.wikimedia.org/r/468940 (https://phabricator.wikimedia.org/T206261) (owner: 10Arturo Borrero Gonzalez) [10:07:12] (03PS1) 10Elukey: eventlogging: remove upstart/Trusty and Jessie support [puppet] - 10https://gerrit.wikimedia.org/r/468941 [10:07:49] (03CR) 10Filippo Giunchedi: [C: 031] eventlogging: don't send logs to the syslog logfile [puppet] - 10https://gerrit.wikimedia.org/r/468915 (owner: 10Elukey) [10:08:11] mobrovac: o/ - is it ok for you if I merge --^ ? [10:08:20] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikimania too [puppet] - 10https://gerrit.wikimedia.org/r/468929 (owner: 10Giuseppe Lavagetto) [10:08:29] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikimania too [puppet] - 10https://gerrit.wikimedia.org/r/468929 [10:08:41] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13128/ - no op!" [puppet] - 10https://gerrit.wikimedia.org/r/468941 (owner: 10Elukey) [10:10:21] (03PS2) 10Elukey: eventlogging: remove upstart/Trusty and Jessie support [puppet] - 10https://gerrit.wikimedia.org/r/468941 [10:11:55] <_joe_> elukey: uh puppet still disabled everywhere? [10:12:07] <_joe_> I saw you working on other things [10:12:25] <_joe_> wait before enabling please [10:12:26] moritzm: its onimisionipe :) [10:12:55] _joe_ I am slowly running puppet in codfw, eqiad still disabled [10:13:26] in the meantime I am working on other changes yes [10:14:14] (03PS3) 10Elukey: eventlogging: remove upstart/Trusty and Jessie support [puppet] - 10https://gerrit.wikimedia.org/r/468941 [10:14:15] <_joe_> ok, I did merge a change that needs to go to mediawikis [10:14:22] <_joe_> and it clearly interacted with yours [10:14:29] <_joe_> tell me when you're done [10:14:52] PROBLEM - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [10:14:54] ACKNOWLEDGEMENT - MegaRAID on analytics1029 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207644 [10:14:59] 10Operations, 10ops-eqiad: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T207644 (10ops-monitoring-bot) [10:15:00] nuooooo [10:15:26] luckily that one is going to be decommed soon :) [10:16:51] onimisionipe: sorry, fixing :-) [10:17:04] (03CR) 10Filippo Giunchedi: [C: 04-1] hiera: diamond::remove on openstack control role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:18:11] (03PS2) 10Giuseppe Lavagetto: mediawiki: remove now unused common includes [puppet] - 10https://gerrit.wikimedia.org/r/468930 [10:18:29] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add scrapes for apache on phabricator instances [puppet] - 10https://gerrit.wikimedia.org/r/468677 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:20:40] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T207644 (10elukey) p:05Triage>03Normal [10:20:57] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T207644 (10elukey) This host needs to be decommed soon, so let's not replace the disk. [10:28:05] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) I just discovered that applying the patch (and running puppet in both cloudnets at the same time) resulted in bot... [10:30:04] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T1030). [10:30:46] _joe_ codfw completed [10:33:25] (03PS4) 10Elukey: eventlogging: remove upstart/Trusty and Jessie support [puppet] - 10https://gerrit.wikimedia.org/r/468941 [10:33:53] (03CR) 10Elukey: [C: 032] eventlogging: remove upstart/Trusty and Jessie support [puppet] - 10https://gerrit.wikimedia.org/r/468941 (owner: 10Elukey) [10:34:09] (03PS3) 10Elukey: eventlogging: don't send logs to the syslog logfile [puppet] - 10https://gerrit.wikimedia.org/r/468915 [10:34:47] (03CR) 10Elukey: [C: 032] eventlogging: don't send logs to the syslog logfile [puppet] - 10https://gerrit.wikimedia.org/r/468915 (owner: 10Elukey) [10:42:46] <_joe_> elukey: ack [10:43:06] <_joe_> elukey: it's now running in eqiad, right [10:45:57] _joe_ I haven't started it in eqiad, wanted your ack first :) [10:46:13] <_joe_> elukey: wait please, I found a wtf [10:46:34] <_joe_> most likely not my fault, but still better to reproduce long-lived WTFs [10:46:54] sure! [10:47:12] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.5416 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [10:47:26] <_joe_> uh this again [10:47:27] (03PS2) 10Filippo Giunchedi: Rebuild for jessie-wikimedia [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468600 (https://phabricator.wikimedia.org/T206633) [10:47:29] (03PS2) 10Filippo Giunchedi: Drop mongodb/relp/czmq integrations, not used at WMF and missing/old from jessie(-backports) [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468601 (https://phabricator.wikimedia.org/T206633) [10:47:31] (03PS2) 10Filippo Giunchedi: Build-depend on newer librdkafka 0.11 [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468602 (https://phabricator.wikimedia.org/T206633) [10:47:38] (03PS2) 10Filippo Giunchedi: Enable mmkubernetes (build depends on libcurl and liblognorm) [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468603 (https://phabricator.wikimedia.org/T206633) [10:47:45] <_joe_> godog: still wdqs? [10:47:59] <_joe_> and, how can I tell what's causing the issue? [10:48:34] <_joe_> oh ok, the kibana homepage is pretty clear [10:49:23] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [10:50:01] indeed still wdqs _joe_ [10:52:23] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup rsyslog to be able to produce logs to Kafka - https://phabricator.wikimedia.org/T206633 (10fgiunchedi) [10:53:02] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10fgiunchedi) [10:59:48] <_joe_> elukey: ok my spelunking expedition tells me the erorr is there since 2005, so I need to reproduce it [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T1100). [11:00:04] Urbanecm, kart_, and Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] present [11:00:13] o/ [11:00:16] ;) [11:00:43] I can swat today [11:01:38] zeljkof: around. [11:02:16] (03PS1) 10Giuseppe Lavagetto: wikimania: use "wikipedia" for upload redirects [puppet] - 10https://gerrit.wikimedia.org/r/468956 [11:02:34] (03CR) 10Giuseppe Lavagetto: [C: 032] wikimania: use "wikipedia" for upload redirects [puppet] - 10https://gerrit.wikimedia.org/r/468956 (owner: 10Giuseppe Lavagetto) [11:02:50] Urbanecm, kart_, and Zoranzoki21: anything urgent? or can I deploy in the calendar order? [11:03:07] _joe_ so shall I run puppet in eqiad? [11:03:14] zeljkof: Nothing urgent [11:03:21] <_joe_> once this is merged, yes [11:03:36] zeljkof, I think you can use calendar's order [11:04:29] zeljkof: nothing urgent. [11:04:32] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468824 (https://phabricator.wikimedia.org/T207589) (owner: 10Urbanecm) [11:04:42] ok, deploying in calendar order then [11:06:35] (03Merged) 10jenkins-bot: Anniversary logo for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468824 (https://phabricator.wikimedia.org/T207589) (owner: 10Urbanecm) [11:06:52] !log zfilipin@deploy1001 sync-file aborted: SWAT: [[gerrit:467412|Test if logo specified in wgLogo/wgLogoHD exists (T207053)]] (duration: 00m 02s) [11:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:56] T207053: Test if logo specified in wgLogo/wgLogoHD exists - https://phabricator.wikimedia.org/T207053 [11:07:15] oops, got the wrong line from history, aborted [11:07:47] Urbanecm: 468824 at mwdebug1002 [11:07:55] testing [11:08:20] zeljkof, please push to prod [11:08:24] Urbanecm: ok [11:09:17] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/468532 (https://phabricator.wikimedia.org/T207470) (owner: 10Hashar) [11:09:20] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:468824|Anniversary logo for cswiki (T207589)]] (duration: 00m 47s) [11:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:24] T207589: Anniversary logo for Czech Wikipedia as celebration of state anniversary - https://phabricator.wikimedia.org/T207589 [11:10:12] Urbanecm: deployed and purged T207589#4685275 [11:10:15] thx [11:11:22] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [11:11:26] (03PS1) 10Elukey: eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 [11:12:21] (03CR) 10Zfilipin: Enable cx2outreach campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [11:12:27] (03PS4) 10Zfilipin: Enable cx2outreach campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [11:12:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [11:13:58] (03Merged) 10jenkins-bot: Enable cx2outreach campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [11:15:21] (03PS2) 10Elukey: eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 [11:16:20] kart_: 467870 is at mwdebug1002 [11:16:55] zeljkof: hard to test, but still checking. [11:17:41] Zoranzoki21: please stand by, you're next [11:18:01] zeljkof: sure [11:18:33] 10Operations, 10Traffic, 10Patch-For-Review: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) [11:18:47] zeljkof: what's URL for fatal on mwdebug1002? [11:18:59] (just in case, also for future ref) [11:19:15] (03PS3) 10Elukey: eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 [11:19:19] kart_: https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Browser_tabs [11:19:28] https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002 [11:20:14] zeljkof: Thanks. We are good to go. [11:20:28] kart_: ok, deploying [11:20:38] (03CR) 10jenkins-bot: Anniversary logo for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468824 (https://phabricator.wikimedia.org/T207589) (owner: 10Urbanecm) [11:20:40] (03CR) 10jenkins-bot: Enable cx2outreach campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467870 (https://phabricator.wikimedia.org/T207031) (owner: 10KartikMistry) [11:21:03] zeljkof: o/ - there is a mcrouter (mw proxy to memcached) change in progress that I am slowly rolling out, I started before swat and didn't realize that I was about to stumble on your feet. I am seeing some exceptions in memcached that are expected, apologies in advance [11:21:26] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467870|Enable cx2outreach campaign (T207031)]] (duration: 00m 47s) [11:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:29] T207031: Create a cx2outreach campaign - https://phabricator.wikimedia.org/T207031 [11:21:43] elukey: ok, thanks for letting me know [11:21:50] kart_: it's deployed, please check [11:22:09] (03CR) 10Mathew.onipe: [C: 031] admin: add ssh key for Antoine Musso [puppet] - 10https://gerrit.wikimedia.org/r/468532 (https://phabricator.wikimedia.org/T207470) (owner: 10Hashar) [11:23:22] (03PS4) 10Elukey: eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 [11:23:33] zeljkof: thanks [11:24:37] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [11:24:53] (03CR) 10Zfilipin: Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [11:24:58] (03PS4) 10Zfilipin: Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [11:25:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [11:25:27] (03CR) 10Effie Mouzeli: [C: 032] admin: add ssh key for Antoine Musso [puppet] - 10https://gerrit.wikimedia.org/r/468532 (https://phabricator.wikimedia.org/T207470) (owner: 10Hashar) [11:26:04] (03PS5) 10Elukey: eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 [11:26:05] (03PS3) 10Effie Mouzeli: admin: add ssh key for Antoine Musso [puppet] - 10https://gerrit.wikimedia.org/r/468532 (https://phabricator.wikimedia.org/T207470) (owner: 10Hashar) [11:26:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Additional ssh key for Antoine "hashar" Musso - https://phabricator.wikimedia.org/T207470 (10jijiki) p:05Triage>03Low a:03jijiki [11:26:46] (03Merged) 10jenkins-bot: Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [11:28:12] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:468075|Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity (T207300)]] (duration: 00m 46s) [11:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:16] T207300: Suppressredirect and Markbotedit user rights to rollbackers on it.Wikiversity - https://phabricator.wikimedia.org/T207300 [11:28:23] Zoranzoki21: 468075 deployed, please test [11:28:32] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2645 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [11:28:42] zeljkof: You can push that in production [11:28:52] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2734 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [11:28:56] it's already deployed, but please check production [11:29:00] (03PS6) 10Elukey: eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 [11:29:14] zeljkof: everything is ok. [11:29:44] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) (owner: 10Zoranzoki21) [11:29:50] (03CR) 10Zfilipin: Enable autopatroller, patroller and rollbacker rights on srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) (owner: 10Zoranzoki21) [11:29:59] (03PS5) 10Zfilipin: Enable autopatroller, patroller and rollbacker rights on srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) (owner: 10Zoranzoki21) [11:30:07] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) (owner: 10Zoranzoki21) [11:31:54] (03Merged) 10jenkins-bot: Enable autopatroller, patroller and rollbacker rights on srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) (owner: 10Zoranzoki21) [11:32:50] (03CR) 10Elukey: "Diff looks wonderful: https://puppet-compiler.wmflabs.org/compiler1002/13135/eventlog1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/468957 (owner: 10Elukey) [11:32:53] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [11:33:13] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [11:33:13] Zoranzoki21: 468079 is at mwdebug1002 [11:34:33] zeljkof: testing [11:35:18] zeljkof: LGTM [11:35:44] (03CR) 10jenkins-bot: Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [11:35:46] (03CR) 10jenkins-bot: Enable autopatroller, patroller and rollbacker rights on srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) (owner: 10Zoranzoki21) [11:36:16] Zoranzoki21: ok, deploying [11:37:17] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:468079|Enable autopatroller, patroller and rollbacker rights on srwikiquote (T206936)]] (duration: 00m 49s) [11:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:21] T206936: Enable autopatroller, patroller and rollbacker rights on srwikiquote - https://phabricator.wikimedia.org/T206936 [11:37:54] Zoranzoki21: 468079 deployed, please test it [11:37:59] zeljkof: Testing in production [11:38:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Additional ssh key for Antoine "hashar" Musso - https://phabricator.wikimedia.org/T207470 (10jijiki) 05Open>03Resolved [11:38:47] zeljkof: sr.wikiquote loads very slow.. [11:39:39] zeljkof: done with my changes [11:39:43] Zoranzoki21: hm, I don't see anything in logs... [11:39:46] elukey: great! [11:39:59] zeljkof: Everything is ok now. Maybe something was with thing which elukey worked [11:40:16] Zoranzoki21: https://sr.wikiquote.org/ loads fine for me [11:40:33] zeljkof: I know [11:40:38] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [11:40:38] zeljkof: Now is ok [11:40:45] cool [11:40:49] (03CR) 10Zfilipin: Enable rollbacker right on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [11:40:55] (03PS5) 10Zfilipin: Enable rollbacker right on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [11:41:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [11:41:52] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.2546 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [11:42:53] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [11:42:53] (03Merged) 10jenkins-bot: Enable rollbacker right on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [11:44:16] Zoranzoki21: 468080 is at mwdebug1002 [11:44:50] zeljkof: testing [11:47:20] zeljkof: LGTM [11:48:50] Zoranzoki21: ok, deploying [11:49:50] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:468080|Enable rollbacker right on srwikisource (T206935)]] (duration: 00m 46s) [11:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:54] T206935: Enable rollbacker right on srwikisource - https://phabricator.wikimedia.org/T206935 [11:50:51] Zoranzoki21: it's deployed, please test [11:51:18] everything is ok [11:51:22] !log eu swat finished [11:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:33] (03CR) 10jenkins-bot: Enable rollbacker right on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [11:59:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10jijiki) a:03jijiki [12:00:09] (03CR) 10Muehlenhoff: [C: 031] admins: add kharlan to 'restricted' group [puppet] - 10https://gerrit.wikimedia.org/r/468327 (https://phabricator.wikimedia.org/T207330) (owner: 10Dzahn) [12:02:07] (03PS1) 10DCausse: Disable spammy wdqs logger [puppet] - 10https://gerrit.wikimedia.org/r/468958 (https://phabricator.wikimedia.org/T207643) [12:08:05] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T207644 (10elukey) @Ottomata one nice thing that I forgot - we can keep running this host without stopping the HDFS daemons due to: ``` dfs.datanode.failed.volumes.tolerated PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2108 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:15:33] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [12:16:53] (03CR) 10Gehel: [C: 032] Disable spammy wdqs logger [puppet] - 10https://gerrit.wikimedia.org/r/468958 (https://phabricator.wikimedia.org/T207643) (owner: 10DCausse) [12:18:30] (03PS1) 10Ema: icinga: remove unused check_http commands [puppet] - 10https://gerrit.wikimedia.org/r/468961 [12:29:45] thanks gehel dcausse onimisionipe for jumping on the problem! [12:30:01] yw! [12:30:35] godog: yw! [12:34:33] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.4564 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [12:35:43] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [12:43:21] (03PS1) 10Filippo Giunchedi: Enable mmkubernetes (build depends on libcurl and liblognorm) and build rsyslog-kubernetes [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/468965 (https://phabricator.wikimedia.org/T206633) [12:44:32] (03CR) 10Filippo Giunchedi: "Note we might not need this, depending if mmkubernetes gets enabled on Debian in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=911299" [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/468965 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [12:49:30] godog: that patch should take care of the immediate issue, but we still need a more general rate limit [12:49:52] godog: ping us if you see other similar problem in the meantime [12:52:32] gehel: will do! now the change should be deployed already, in case it happens again? [12:53:23] for this specific logger, yes, it should be deployed [12:56:16] ack, thanks! [12:56:39] yeah my "mitigation" on the logstash side about bumping the receive buffers didn't help a whole lot [13:08:51] !log kartik@deploy1001 Started deploy [cxserver/deploy@5f53734]: Update cxserver to 7f996f3 (T207445) [13:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:54] T207445: CX2: Paragraph not added to the translation with MT failure message - https://phabricator.wikimedia.org/T207445 [13:09:05] godog: I'm writing a rate limiter for wdqs logging. Any idea what a sane limit should be for a first try? [13:10:04] I'm thinking 5k events per minute. That seems large enough to only be triggered when there is a real issue and low enough that we probably don't overload logstash too much [13:11:53] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving content from php7 [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) [13:12:44] !log kartik@deploy1001 Finished deploy [cxserver/deploy@5f53734]: Update cxserver to 7f996f3 (T207445) (duration: 03m 53s) [13:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:50] gehel: 5k/min per host? [13:13:32] yep, per host [13:14:45] honestly, no idea what a sane limite should be. The peak on this issue seemed to be 300k/minute/host as received by logstash [13:16:04] (03CR) 10Ottomata: [C: 031] "thank youuuu!" [puppet] - 10https://gerrit.wikimedia.org/r/468915 (owner: 10Elukey) [13:16:27] 300k/minute is obviously insane! [13:16:28] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/13137/ the change is a noop as it stands, I'd like reviews on the logic though." [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) (owner: 10Giuseppe Lavagetto) [13:16:40] (03CR) 10Ottomata: [C: 031] "grrrr8" [puppet] - 10https://gerrit.wikimedia.org/r/468941 (owner: 10Elukey) [13:18:00] (03CR) 10Ottomata: [C: 031] eventlogging: remove upstart/Trusty and Jessie support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468941 (owner: 10Elukey) [13:18:15] (03CR) 10Ottomata: [C: 031] eventlogging::server: rotate logs on size (not only on time) [puppet] - 10https://gerrit.wikimedia.org/r/468718 (owner: 10Elukey) [13:19:16] gehel: indeed, I don't have a good sense of what the limit really is, though bursts are problematic for sure, even inserting some pacing say 30/s would work I think [13:19:42] !log Run myloader for enwikivoyage cebwiki shwiki srwiki mgwiktionary on db2052 (s5 codfw master) - T184805 [13:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:46] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [13:20:19] (03PS1) 10Ema: ATS: define icinga check for HTTP responses [puppet] - 10https://gerrit.wikimedia.org/r/468971 (https://phabricator.wikimedia.org/T204209) [13:20:50] gehel: I mentioned per-second because I suspect per-minute would yield similar problems with bursts [13:20:55] (03PS5) 10Herron: confluent::kafka::common: force provider => 'systemd' for services [puppet] - 10https://gerrit.wikimedia.org/r/468498 (https://phabricator.wikimedia.org/T206454) [13:21:41] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Joe) [13:21:44] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Joe) 05Open>03Resolved [13:21:47] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) [13:21:55] (03CR) 10Herron: [C: 032] confluent::kafka::common: force provider => 'systemd' for services [puppet] - 10https://gerrit.wikimedia.org/r/468498 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [13:21:56] godog: so 100 log event / second? [13:22:04] * gehel is inventing numbers [13:22:14] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Joe) This is now fully done in both deployment-prep (beta) and production. [13:22:19] let's start with something, see how it works and tune it [13:22:42] gehel: yeah sounds like a good first step [13:23:25] (03CR) 10Elukey: [C: 032] eventlogging: remove upstart/Trusty and Jessie support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468941 (owner: 10Elukey) [13:25:23] PROBLEM - puppet last run on ms-be1040 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[xfs_label-/dev/sdb3],Exec[xfs_label-/dev/sdb4] [13:28:19] (03CR) 10Ottomata: [C: 031] eventlogging: remove upstart/Trusty and Jessie support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468941 (owner: 10Elukey) [13:30:00] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967 (10mpopov) >>! In T204688#4685007, @Krenair wrote: > Hi all, just to let you know this now has a deadline of 2018-12-18 per https://wikitech.wikimedia.org/wiki/News/Trusty_deprec... [13:31:56] (03CR) 10Ema: [C: 032] "pcc looks fine: https://puppet-compiler.wmflabs.org/compiler1002/13138/" [puppet] - 10https://gerrit.wikimedia.org/r/468971 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [13:32:05] (03PS2) 10Ema: ATS: define icinga check for HTTP responses [puppet] - 10https://gerrit.wikimedia.org/r/468971 (https://phabricator.wikimedia.org/T204209) [13:34:43] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967 (10MoritzMuehlenhoff) Are these only used in Cloud VPS, not in production? If the former, we don't need to mirror them to our apt.wikimedia.org repository, they could just as wel... [13:39:15] 10Operations, 10Traffic: Varnish won't purge thumbnails of specific file - https://phabricator.wikimedia.org/T207615 (10Gilles) a:03BBlack [13:43:56] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10MoritzMuehlenhoff) I'm removing SRE-Access-Requests here, as this request applies to frtech. [13:44:05] 10Operations, 10Fundraising-Backlog, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10MoritzMuehlenhoff) [13:45:04] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: WDQS logging to logstash should be rate limited - https://phabricator.wikimedia.org/T207656 (10Gehel) [13:45:43] RECOVERY - puppet last run on ms-be1040 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:46:04] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: WDQS logging to logstash should be rate limited - https://phabricator.wikimedia.org/T207656 (10Gehel) p:05Triage>03High [13:48:54] (03PS1) 10Filippo Giunchedi: debian: add packaging [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/468978 (https://phabricator.wikimedia.org/T206633) [13:49:28] akosiaris: is compiler.puppet.eqiad.wmflabs still doing anything? It's running trusty, >4 years old, and none of us can log in to it :) [13:49:40] (03PS1) 10Gehel: wdqs: rate limit log sent to logstash [puppet] - 10https://gerrit.wikimedia.org/r/468979 [13:54:31] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967 (10mpopov) Currently Shiny Server is available (via its developer, RStudio) as a package only for Ubunty Trusty. This task is about packaging it up ourselves to make it available... [13:56:14] (03CR) 10Elukey: "Andrew: about this, I think that I sorted it out in grafana simply using the instance port.. Have you checked it?" [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [13:56:44] (03CR) 10Elukey: [C: 031] hiera: remove diamond from mediawiki role [puppet] - 10https://gerrit.wikimedia.org/r/466905 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [13:57:11] (03PS1) 10Faidon Liambotis: pdns_server: set pdns.conf mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/468981 [13:57:49] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967 (10Paladox) It seems they do package for debian, see https://cran.rstudio.com/bin/linux/debian/ [13:58:01] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10mpopov) [13:59:12] (03CR) 10Andrew Bogott: [C: 032] pdns_server: set pdns.conf mode to 0440 [puppet] - 10https://gerrit.wikimedia.org/r/468981 (owner: 10Faidon Liambotis) [13:59:36] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10herron) Thanks @aborrero! After a quick ping test using the cloudinfra project I'm still seeing traffic originate from `18... [14:00:32] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Stretch apt repository - https://phabricator.wikimedia.org/T168967 (10mpopov) >>! In T168967#4685654, @Paladox wrote: > It seems they do package for debian, see https://cran.rstudio.com/bin/linux/debian/ That's the RStudio-hosted mirror of CRA... [14:01:25] (03PS1) 10Herron: Revert "Revert "site: enable logging Kafka on Logstash nodes"" [puppet] - 10https://gerrit.wikimedia.org/r/468983 [14:01:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10jijiki) a:03jijiki [14:02:22] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM from a quick look" [puppet] - 10https://gerrit.wikimedia.org/r/468961 (owner: 10Ema) [14:02:56] (03PS2) 10Herron: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/468983 (https://phabricator.wikimedia.org/T206454) [14:07:33] (03PS3) 10Herron: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/468983 (https://phabricator.wikimedia.org/T206454) [14:08:29] (03CR) 10Herron: [C: 032] site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/468983 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [14:08:42] 10Operations, 10Traffic, 10Patch-For-Review: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) [14:11:10] akosiaris and godog: Logstash in beta is empty again [14:11:45] Also I deployed the change for sending logs to logstash in prod but can't see anything there. it might be because it's only ERROR level [14:12:54] PROBLEM - Check systemd state on logstash1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:13:44] ^ that’s me [14:15:29] (03PS1) 10Mathew.onipe: admin: add aaron(Aaron Schulz) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/468989 (https://phabricator.wikimedia.org/T207090) [14:16:05] 10Puppet, 10Cloud-VPS (Ubuntu Trusty Deprecation): cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10Krenair) 05Resolved>03Open Sorry, reopening this one because one was missed (on account of being inaccessible when Cumin was being run to find all trusty instances):... [14:17:19] 10Operations, 10ops-codfw: scb2001: Power supply failure - https://phabricator.wikimedia.org/T207629 (10Papaul) p:05Triage>03Normal [14:17:27] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krenair) [14:17:34] 10Puppet, 10Beta-Cluster-Infrastructure, 10Goal, 10Patch-For-Review: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644 (10Krenair) [14:18:10] 10Operations, 10Patch-For-Review: Netbox: explore NAPALM integration - https://phabricator.wikimedia.org/T205898 (10ayounsi) The upgrade solved the issue, and brought the next one. @Volans investigated it and it seems like we're hitting that one: https://github.com/paramiko/paramiko/issues/23 In short, we have... [14:18:14] PROBLEM - puppet last run on logstash1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:20:11] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving content from php7 [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) [14:20:13] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: enable serving content from php7-fpm [puppet] - 10https://gerrit.wikimedia.org/r/468990 (https://phabricator.wikimedia.org/T206338) [14:25:15] 10Operations, 10netops, 10Goal: Increase network capacity (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T199142 (10ayounsi) 05Open>03Invalid Marking it as invalid as we did not reach that goal fully: Done: - eqiad: 1 rows with 3*10G racks - ulsfo: replace routers - eqdfw: replace router Not d... [14:27:23] PROBLEM - puppet last run on logstash1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:28:05] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving content from php7 [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) [14:28:07] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: enable serving content from php7-fpm [puppet] - 10https://gerrit.wikimedia.org/r/468990 (https://phabricator.wikimedia.org/T206338) [14:33:30] PROBLEM - puppet last run on logstash1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:36:57] (03PS5) 10Giuseppe Lavagetto: mediawiki::web::vhost: allow serving content from php7 [puppet] - 10https://gerrit.wikimedia.org/r/467878 (https://phabricator.wikimedia.org/T206338) [14:36:59] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: enable serving content from php7-fpm [puppet] - 10https://gerrit.wikimedia.org/r/468990 (https://phabricator.wikimedia.org/T206338) [14:39:11] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10Krenair) >>! In T174596#4685019, @aborrero wrote: > * I don't really understand what means the c... [14:42:24] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for nginx on debug proxies [puppet] - 10https://gerrit.wikimedia.org/r/466852 (https://phabricator.wikimedia.org/T135991) [14:42:46] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) p:05Triage>03Normal [14:43:18] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for nginx on debug proxies [puppet] - 10https://gerrit.wikimedia.org/r/466852 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:44:20] RECOVERY - IPMI Sensor Status on scb2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [14:47:42] (03PS2) 10Bstorm: sonofgridengine: remove non-working hiera file [puppet] - 10https://gerrit.wikimedia.org/r/468686 [14:47:58] 10Operations, 10Cloud-Services, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406 (10ayounsi) Here are the tasks for the record: > - Deprecation of the old space, which is equivalent at this point to deprecation of nova-network. This is already planned... [14:47:59] (03CR) 10Ottomata: "Ah hm, seems fine! I think I'd prefer to at least have a label value "namenode", but atm I prefer this than bikeshedding the label name. " [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [14:48:01] (03Abandoned) 10Ottomata: Add more labels to Hadoop daemon JMX prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/468056 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [14:48:36] PROBLEM - Kafka Broker Server on logstash1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [14:48:46] (03CR) 10Bstorm: [C: 032] sonofgridengine: remove non-working hiera file [puppet] - 10https://gerrit.wikimedia.org/r/468686 (owner: 10Bstorm) [14:48:51] known ^ [14:48:52] hey? [14:48:54] ok [14:48:55] thanks :) [14:48:59] sorry for the spam! [14:48:59] ah ha [14:49:18] !log push firewall changes to pfw3-codfw - T207175 [14:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:22] T207175: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 [14:52:59] !log mforns@deploy1001 Started deploy [analytics/refinery@bbebc20]: deploying refinery together with refinery-source v0.0.79 [14:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:13] (03PS1) 10Bstorm: sonofgridengine: remove duplicate declaration with standard packages [puppet] - 10https://gerrit.wikimedia.org/r/468995 (https://phabricator.wikimedia.org/T200557) [14:59:56] (03CR) 10Bstorm: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) (owner: 10GTirloni) [15:00:14] (03CR) 10Bstorm: [C: 032] sonofgridengine: remove duplicate declaration with standard packages [puppet] - 10https://gerrit.wikimedia.org/r/468995 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [15:03:15] !log mforns@deploy1001 Finished deploy [analytics/refinery@bbebc20]: deploying refinery together with refinery-source v0.0.79 (duration: 10m 16s) [15:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:46] (03PS2) 10Mathew.onipe: admin: add aaron(Aaron Schulz) to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/468989 (https://phabricator.wikimedia.org/T207090) [15:05:20] (03PS5) 10GTirloni: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) [15:05:57] 10Operations, 10ops-codfw: scb2001: Power supply failure - https://phabricator.wikimedia.org/T207629 (10Papaul) @MoritzMuehlenhoff How do you want to do this? If I call Dell today they will send me a PSU tomorrow or I am flying to Chicago tomorrow so the PSU will be in shipping until I am back and the server... [15:08:18] (03PS1) 10Elukey: profile::analytics::refinery::job::camus: enable eventlogging-client-side [puppet] - 10https://gerrit.wikimedia.org/r/468998 (https://phabricator.wikimedia.org/T206542) [15:09:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10jijiki) a:03Mathew.onipe [15:12:29] (03CR) 10Ottomata: [C: 031] eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 (owner: 10Elukey) [15:15:01] ottomata elukey, kafka broker on logstash1004 looks like it starts via kafka-server-starts, prints the config and then exits 1, have you run into this before? [15:16:39] nope, very weird :( [15:17:30] (03PS6) 10GTirloni: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) [15:18:30] godog: mmm this looks weird [15:18:31] Error while executing ACL command: Zookeeper namespace does not exist [15:18:34] org.apache.kafka.common.config.ConfigException: Zookeeper namespace does not exist [15:18:41] I tried kafka acls --list [15:19:01] on logstash1004 ? [15:19:06] yeah [15:19:11] also I just noticed kafka 0.9 was installed not 1.1 [15:19:23] ah snap [15:19:34] sad_trombone.wav [15:20:03] (03CR) 10GTirloni: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) (owner: 10GTirloni) [15:20:08] so kafka 1.1 is shipped by confluent-kafka-2.11 correct? [15:20:18] while 0.9 is shipped by confluent-kafka-2.11.7 (?) [15:20:37] this is something that I am super ignorant about [15:20:57] IIRC the 2.X is the version of scala [15:21:36] ah wait this is jessie [15:21:50] that sonds right [15:22:32] root@install1002:/srv/wikimedia# reprepro ls confluent-kafka-2.11.7 [15:22:35] (scala version) [15:22:35] confluent-kafka-2.11.7 | 0.9.0.1-1 | trusty-wikimedia | amd64 [15:22:36] godog: --^ [15:22:36] yeah jessie, anything specific we should consider for jessie? [15:22:38] confluent-kafka-2.11.7 | 0.9.0.1-1 | jessie-wikimedia | amd64 [15:22:41] root@install1002:/srv/wikimedia# reprepro ls confluent-kafka-2.11 [15:22:43] confluent-kafka-2.11 | 1.1.0-1 | stretch-wikimedia | amd64, i386 [15:22:51] godog: y jessie? just curious? [15:23:04] we could also import the same .deb for jessie i guess [15:23:06] ottomata: heh because logstash is still jessie [15:24:27] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::camus: enable eventlogging-client-side [puppet] - 10https://gerrit.wikimedia.org/r/468998 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [15:24:29] ah yeah I see confluent-kafka is in another component too in stretch-wikimedia [15:24:40] thirdparty/confluent ? [15:24:59] yeah, ok I'll try the 1.1 package first manually [15:25:22] 10Operations, 10User-jijiki: Redefine privileges and access for perf-roots group - https://phabricator.wikimedia.org/T207666 (10jijiki) [15:27:24] (03CR) 10Dzahn: "i think this is stalled waiting on consensus on the ticket on how to proceed with gerrit avatars in a wider sense (whether to use phab or " [puppet] - 10https://gerrit.wikimedia.org/r/456437 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [15:27:27] ok 1.1 works as expected! meaning it fails but verbosely [15:28:07] ahhahah [15:28:09] nice [15:28:28] godog: if you guys are using the ACLs there should be some doc about it [15:28:32] not sure if you've read it [15:28:37] (they are stored in zookeeper) [15:28:48] (03CR) 10Cwhite: [C: 031] thumbor: use statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/467988 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [15:29:29] elukey: no haven't read about it yet but thanks for the pointer! [15:33:10] RECOVERY - Check systemd state on logstash1004 is OK: OK - running: The system is fully operational [15:33:16] RECOVERY - Kafka Broker Server on logstash1004 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [15:33:56] ok after having put the keystore passwords in, looks like it works [15:34:18] the paging works too :-P [15:35:47] godog: https://wikitech.wikimedia.org/wiki/Kafka/Administration#Kafka_ACLs [15:35:57] !log push firewall changes to pfw3-eqiad - T207175 [15:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:00] T207175: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 [15:36:56] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Andrew) Things seem better this week! Is that my imagination? [15:38:07] (03PS3) 10Andrew Bogott: shinken: Remove broken 'Keyholder status' check [puppet] - 10https://gerrit.wikimedia.org/r/468848 (https://phabricator.wikimedia.org/T183454) (owner: 10GTirloni) [15:38:12] (03PS1) 10Herron: Revert "site: enable logging Kafka on Logstash nodes" [puppet] - 10https://gerrit.wikimedia.org/r/469008 [15:39:23] (03CR) 10Andrew Bogott: [C: 032] shinken: Remove broken 'Keyholder status' check [puppet] - 10https://gerrit.wikimedia.org/r/468848 (https://phabricator.wikimedia.org/T183454) (owner: 10GTirloni) [15:39:32] (03PS7) 10Elukey: eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 [15:40:33] (03CR) 10Elukey: [C: 032] eventlogging::analytics: move to profiles [puppet] - 10https://gerrit.wikimedia.org/r/468957 (owner: 10Elukey) [15:41:02] (03CR) 10Gehel: "See comments inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:42:28] godog: herron btw i think you can do monitoring_enabled: false on your kafka cluster until you got it all set up [15:42:30] and it won't page [15:42:31] i think... [15:42:48] oh nice! good to know [15:43:17] (03CR) 10Herron: [C: 032] "Reverting due to missing confluent-kafka-2.11 1.1 jessie package, java keystore password issue and additional puppet modifications needed " [puppet] - 10https://gerrit.wikimedia.org/r/468983 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [15:43:43] (03PS2) 10Herron: Revert "site: enable logging Kafka on Logstash nodes" [puppet] - 10https://gerrit.wikimedia.org/r/469008 [15:44:35] (03CR) 10Herron: [C: 032] Revert "site: enable logging Kafka on Logstash nodes" [puppet] - 10https://gerrit.wikimedia.org/r/469008 (owner: 10Herron) [15:45:57] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) >>! In T206261#4685659, @herron wrote: > Thanks @aborrero! > > After a quick ping test using the cloudinfra proj... [15:46:24] (03CR) 10Gehel: [C: 04-1] "minor comments inline." (033 comments) [labs/private] - 10https://gerrit.wikimedia.org/r/468631 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [15:46:31] 10Operations, 10ops-codfw: scb2001: Power supply failure - https://phabricator.wikimedia.org/T207629 (10MoritzMuehlenhoff) It's fine to wait with the component swap until you're back. Thanks. [15:47:30] (03PS11) 10Gehel: scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [15:50:10] RECOVERY - puppet last run on logstash1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:51:01] (03CR) 10Gehel: [C: 032] scap::target: added additional_services_names param [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [15:51:38] mobrovac: ^ fyi, I merged that scap refactoring, it should be fine, but ping me if you see something strange [15:53:59] RECOVERY - puppet last run on logstash1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:17] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) I've created a new VM, t206636-2.wikidata-query.eqiad.wmflabs. This is... [15:57:37] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374 (10Milimetric) The fix for this would be high risk and low gain. So keeping around to just have context in case the problem does manifest. [15:59:44] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) Some minimal packet drop is still seen (< 100 packet / 24h), so the situation is very much be... [16:00:57] 10Operations, 10Traffic, 10netops, 10Goal: Increase network capacity (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T207668 (10ayounsi) p:05Triage>03Normal [16:03:36] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10ayounsi) [16:03:41] 10Operations, 10fundraising-tech-ops, 10netops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10ayounsi) 05Open>03Resolved [16:04:41] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: fix dmz_cidr adressing for inter-VM connections [puppet] - 10https://gerrit.wikimedia.org/r/469019 (https://phabricator.wikimedia.org/T206261) [16:04:46] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) There were three issues observed during todays aborted deploy of logging kafka (https://gerrit.wikimedia.org/r/... [16:05:07] (03PS4) 10Mforns: Fine tune eventlogging_to_druid_job spark and druid parameters [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) [16:06:09] (03CR) 10Mforns: "I think this is ready for merging if good!" [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [16:07:38] (03PS1) 10Andrew Bogott: horizon: move 'suggestbot' to the new neutron region [puppet] - 10https://gerrit.wikimedia.org/r/469020 (https://phabricator.wikimedia.org/T204745) [16:08:23] (03CR) 10Andrew Bogott: [C: 032] horizon: move 'suggestbot' to the new neutron region [puppet] - 10https://gerrit.wikimedia.org/r/469020 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [16:08:26] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [16:08:35] (03PS1) 10CRusnov: icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/469021 (https://phabricator.wikimedia.org/T207009) [16:10:27] (03Abandoned) 10CRusnov: icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/468612 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [16:11:29] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) I don't see any problem with this from the top of my head. I would ask @chasemp to see if he can see any issue with this new setting. Please note that d... [16:14:44] chaomodus hi, if you have the same Change-Id gerrit can be smart and update the existing change :) [16:15:01] (if you remove the Change-Id a new one is generated so a new change is created) [16:15:40] RECOVERY - puppet last run on logstash1008 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:16:34] ooooh [16:16:37] that explains it [16:16:38] thanks! [16:16:52] i outsmarted the smartness [16:17:00] lol [16:20:12] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: fix dmz_cidr adressing for inter-VM connections [puppet] - 10https://gerrit.wikimedia.org/r/469019 (https://phabricator.wikimedia.org/T206261) [16:23:45] welcome chaomodus BTW :-) [16:24:22] !log T206261 2h icinga downtime cloudnet1003/4 for another patch [16:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:27] T206261: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 [16:24:53] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: eqiad1: fix dmz_cidr adressing for inter-VM connections [puppet] - 10https://gerrit.wikimedia.org/r/469019 (https://phabricator.wikimedia.org/T206261) (owner: 10Arturo Borrero Gonzalez) [16:25:48] hah thanks :) [16:27:50] PROBLEM - HHVM jobrunner on mw1336 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:50] RECOVERY - HHVM jobrunner on mw1336 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [16:34:47] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, and 2 others: Routing RFC1918 private IP addresses to/from WMCS floating IPs - https://phabricator.wikimedia.org/T206261 (10aborrero) Please @herron try now. This is my test: ``` aborrero@cloudinfra-puppetmaster-01:~$ ping -c1 185.15.56.18 PING 1... [16:34:49] Amir1: gah forgot to reply, could you reopen the related task or open a new one? I'll forget about it otherwsie [16:34:52] 10Operations, 10ops-ulsfo, 10decommission: decommission backup4001 - https://phabricator.wikimedia.org/T161904 (10RobH) [16:36:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10ArielGlenn) What's the plan for the rest of the perf-roots and getting them this access? [16:42:48] 10Operations, 10Puppet, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Get rid of "import realm.pp" in manifests/site.pp - https://phabricator.wikimedia.org/T154915 (10dduvall) @hashar, this is marked as "Done" in RelEng Kanban. Any update? [16:43:21] 10Operations, 10DBA, 10JADE, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) [16:43:26] 10Operations, 10DBA, 10JADE, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 4 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) 05Open>03Resolved [16:46:32] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10chasemp) No technical blockers to this VLAN have public IPs that I know of. Agreed that the switchover would be difficult to make transparent to users. It's poss... [16:46:44] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @daniel @Krinkle @Catrope @Marostegui We're ready for another round of TechCom and DBA review, at... [16:47:38] (03PS1) 10Urbanecm: Revert "Anniversary logo for cswiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469027 (https://phabricator.wikimedia.org/T207589) [16:47:59] (03CR) 10Urbanecm: [C: 04-1] "Do not merge without my approval." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469027 (https://phabricator.wikimedia.org/T207589) (owner: 10Urbanecm) [16:56:50] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:58:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting deployment access to servers for Performance Team task for perf-roots - https://phabricator.wikimedia.org/T207090 (10jijiki) >>! In T207090#4686186, @ArielGlenn wrote: > What's the plan for the rest of the perf-roots and getting them this ac... [16:59:52] (03CR) 10Jforrester: "Security have given permission to go to Beta Cluster." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:00:04] gehel and onimisionipe: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T1700). [17:00:35] onimisionipe: ping me if you need help on that one! [17:01:11] jouncebot: now [17:01:11] For the next 0 hour(s) and 28 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T1700) [17:01:35] (03CR) 10Dzahn: [C: 031] "approved in SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/468327 (https://phabricator.wikimedia.org/T207330) (owner: 10Dzahn) [17:01:58] jouncebot: next [17:01:58] In 0 hour(s) and 58 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T1800) [17:04:00] gehel: alright! [17:05:04] !log andrew@deploy1001 Started deploy [horizon/deploy@431a55d]: Rolling out fix for T207510 [17:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:08] T207510: 'Detach Interface' option visible on Horizon - https://phabricator.wikimedia.org/T207510 [17:05:09] PROBLEM - IPMI Sensor Status on scb2001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:07:42] !log disable puppet fleet-wide for puppetmaster1001 uplink move [17:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:00] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 57.68 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:08:44] (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/469021 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [17:08:44] !log andrew@deploy1001 Finished deploy [horizon/deploy@431a55d]: Rolling out fix for T207510 (duration: 03m 40s) [17:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:58] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) moving to TechCom inbox for review [17:10:30] !log moving puppetmaster1001 uplink to asw2-b [17:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:08] !log re-enable puppet fleet-wide for puppetmaster1001 uplink move [17:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:32] !log elukey@deploy1001 Started deploy [analytics/refinery@1de5f44]: Deploy new version of Camus and pageview whitelist [17:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:49] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 83.97 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:15:57] (03CR) 10Dzahn: [C: 031] "approved in SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/468989 (https://phabricator.wikimedia.org/T207090) (owner: 10Mathew.onipe) [17:16:23] !log analytics1068 down for mother board swap [17:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:58] (03PS1) 10Elukey: profile::analytics::refinery::job::camus: upgrade camus jar [puppet] - 10https://gerrit.wikimedia.org/r/469037 [17:19:32] ottomata: --^ (if you have time) [17:19:38] !log elukey@deploy1001 Finished deploy [analytics/refinery@1de5f44]: Deploy new version of Camus and pageview whitelist (duration: 07m 05s) [17:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:43] (03CR) 10Ottomata: [C: 031] profile::analytics::refinery::job::camus: upgrade camus jar [puppet] - 10https://gerrit.wikimedia.org/r/469037 (owner: 10Elukey) [17:19:54] thanks! [17:20:19] PROBLEM - Host analytics1068.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:20:34] !log disable cr2:xe-4/0/0 (to asw-a) for optics replacement - T203719 [17:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:37] T203719: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 [17:20:59] Chris is working on an1068, motherboard replacement [17:21:04] coo [17:21:56] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::camus: upgrade camus jar [puppet] - 10https://gerrit.wikimedia.org/r/469037 (owner: 10Elukey) [17:22:18] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@975a67b]: WDQS deployment - GUI update and binaries upgrade [17:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:05] (03PS2) 10CRusnov: icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/469021 (https://phabricator.wikimedia.org/T207009) [17:23:13] (03CR) 10Alexandros Kosiaris: [C: 032] Proton: Configure the fonts [puppet] - 10https://gerrit.wikimedia.org/r/468545 (https://phabricator.wikimedia.org/T199264) (owner: 10Mobrovac) [17:23:21] (03PS3) 10Alexandros Kosiaris: Proton: Configure the fonts [puppet] - 10https://gerrit.wikimedia.org/r/468545 (https://phabricator.wikimedia.org/T199264) (owner: 10Mobrovac) [17:23:44] !log enable cr2:xe-4/0/0 (to asw-a) for optics replacement - T203719 [17:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:47] chaomodus one other thing i would add is you can edit through the inline editor :) [17:26:50] 10Operations, 10ops-eqiad, 10netops: Interface errors on cr2-eqiad:xe-4/0/0 - https://phabricator.wikimedia.org/T203719 (10Cmjohnson) I replaced the optics on asw-a-eqiad:xe-2/1/2 {#3455} and cleaned the fiber. [17:28:57] (03CR) 10Ottomata: [C: 032] Fine tune eventlogging_to_druid_job spark and druid parameters [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [17:29:03] (03PS5) 10Ottomata: Fine tune eventlogging_to_druid_job spark and druid parameters [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [17:29:05] (03CR) 10Ottomata: [V: 032 C: 032] Fine tune eventlogging_to_druid_job spark and druid parameters [puppet] - 10https://gerrit.wikimedia.org/r/468588 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [17:31:58] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10ayounsi) Talked to Arturo on IRC, replying to my own questions. I thought dmz_cidr were only the... [17:32:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:34:05] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@975a67b]: WDQS deployment - GUI update and binaries upgrade (duration: 11m 47s) [17:34:05] onimisionipe, gehel: Is the deployment window over? Can I deploy some config changes? [17:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:11] Ha, timing. [17:34:15] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10faidon) >>! In T174596#4685786, @Krenair wrote: >>>! In T174596#4685019, @aborrero wrote: >> * I... [17:34:40] :P [17:35:14] James_F: Not yet. Gimmie few minutes. I'll let you know when its done! [17:35:26] Of course. [17:36:23] James_F: actually, wdqs is sufficiently isolated from mediawiki that risk is very minimal, go on if it is a mw-config change [17:36:32] Cool, will do. [17:37:05] whoo [17:37:18] (03CR) 10Jforrester: [C: 032] "Pre-SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:38:00] James_F: in theroy that will need a full scap [17:38:15] but it will only need that for the production deployment [17:38:36] (03Merged) 10jenkins-bot: Install but don't enable the WikibaseMediaInfo extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:38:45] !log mobrovac@deploy1001 Started deploy [proton/deploy@b3e254a]: Update Puppeteer to v1.9.0 - T207416 [17:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:48] (03CR) 10Jforrester: [C: 031] Adding TemplateWizard to Beta Features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [17:38:48] T207416: Upgrade Puppeteer to 1.9.0 - https://phabricator.wikimedia.org/T207416 [17:39:20] addshore: Yeah, we'll get the full scap tonight. [17:39:31] James_F: as part of? [17:39:48] James_F: You are good now! [17:39:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:39:55] addshore: Oh, right, the nightly sync isn't running right now. [17:39:58] indeed [17:40:08] addshore: We'll get it tomorrow morning with the train, then. :-D [17:40:10] but we will get one tommorrow anyway fir the branch [17:40:11] yup [17:40:19] !log mobrovac@deploy1001 Finished deploy [proton/deploy@b3e254a]: Update Puppeteer to v1.9.0 - T207416 (duration: 01m 34s) [17:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:27] and we wont be on prod for a little time yet ;) [17:40:40] Well, depends how well this goes. :-D [17:40:46] James_F: is mediainfo in the make wmf branch script already? [17:41:29] (03CR) 10jenkins-bot: Install but don't enable the WikibaseMediaInfo extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:41:31] addshore: Since July. [17:41:41] * James_F bows. [17:42:05] James_F: woo [17:42:06] (03CR) 10Jforrester: [C: 032] "Pre-SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:43:49] (03CR) 10Addshore: [C: 031] Install but don't enable the WikibaseMediaInfo extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:43:53] (03CR) 10Addshore: [C: 031] Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester) [17:43:59] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:44:00] (03CR) 10Addshore: [C: 031] Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:44:31] James_F: wgMediaInfoEnable is in MediaInfo itself? [17:44:52] (03CR) 10Addshore: [C: 031] Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:45:08] (03CR) 10Addshore: [C: 04-1] "Not yet :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:45:43] addshore: Yes, I added it to make deployment easier. [17:45:44] (03PS2) 10Cwhite: prometheus: add scrapes for apache on phabricator instances [puppet] - 10https://gerrit.wikimedia.org/r/468677 (https://phabricator.wikimedia.org/T183454) [17:45:56] (03CR) 10Alexandros Kosiaris: [C: 031] Enable mmkubernetes (build depends on libcurl and liblognorm) [debs/rsyslog] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/468603 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [17:46:13] addshore: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikibaseMediaInfo/+/master/extension.json#24 [17:46:14] !log jforrester@deploy1001 Synchronized wmf-config/extension-list: Add WikibaseMediaInfo i18n to cache cf. T180981 (duration: 00m 46s) [17:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:17] T180981: Deploy WikibaseMediaInfo extension to beta - https://phabricator.wikimedia.org/T180981 [17:46:47] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) [17:47:08] James_F: lovely [17:47:08] (03CR) 10Alexandros Kosiaris: [C: 031] "Would be indeed great if it was also solved upstream" [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/468965 (https://phabricator.wikimedia.org/T206633) (owner: 10Filippo Giunchedi) [17:48:01] (03PS3) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) [17:48:13] (03CR) 10Jforrester: [C: 032] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:49:51] (03PS1) 10Varnent: Enable FileExporter for Gov Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469042 (https://phabricator.wikimedia.org/T207502) [17:50:17] (03Merged) 10jenkins-bot: Install but don't enable the WikibaseMediaInfo extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [17:50:59] (03CR) 10jerkins-bot: [V: 04-1] Enable FileExporter for Gov Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469042 (https://phabricator.wikimedia.org/T207502) (owner: 10Varnent) [17:53:42] 10Puppet, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Beta-Cluster-reproducible, 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10Krenair) [17:54:18] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10cscott) w00t! Now we can do https://gerrit.wikimedia.org/r/368248 (T117845) ? [17:54:32] (03CR) 10C. Scott Ananian: "T196968 was just resolved today, so hopefully we can get back to this?" [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [17:56:28] PROBLEM - High lag on wdqs1003 is CRITICAL: 3602 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:56:32] (03PS4) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) [17:56:59] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Allow enablement of the WikibaseMediaInfo, still off everywhere cf. T180981 (duration: 00m 48s) [17:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:03] T180981: Deploy WikibaseMediaInfo extension to beta - https://phabricator.wikimedia.org/T180981 [17:57:15] (03CR) 10Jforrester: [C: 032] Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester) [17:58:08] RECOVERY - Host analytics1068.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.25 ms [17:58:22] (03Merged) 10jenkins-bot: Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester) [17:58:56] the one after this will be the fun one :) [17:59:07] Ha. [17:59:13] 10Puppet, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Beta-Cluster-reproducible, 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10dcausse) problems seen on deployment-logstash2 so far: - it has the cluster `b... [18:00:05] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T1800) [18:00:05] James_F, Jdlrobson, RoanKattouw, mooeypoo, and Niharika: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:26] (Heya, I'm already deploying now, obviously.) [18:00:29] :D [18:00:49] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: For WikibaseMediaInfo wikis, load basic Wikibase repo code cf. T180981 (duration: 00m 46s) [18:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:02] James_F: its snack time now, but I'm still ready and wtaching for the next patch [18:01:10] \o [18:01:24] mine is an UBN and 2 patches that need to be merged together [18:01:43] (03PS3) 10Jforrester: Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) [18:01:53] (03CR) 10Jforrester: [C: 032] Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:02:12] 10Puppet, 10Beta-Cluster-Infrastructure, 10Discovery-Search, 10Beta-Cluster-reproducible, 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10Krenair) See also {T205863} [18:02:13] (03PS3) 10Jforrester: Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) [18:02:34] jdlrobson: Hey, no worries, will get to them soon, hopefully. [18:02:52] woah [18:03:00] I just tried to git pull --rebase origin production [18:03:00] sweet [18:03:04] this happened: fatal: bad object 0000000000000000000000000000000000000000 [18:03:18] (03Merged) 10jenkins-bot: Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:04:06] Krenair: where is that from? [18:04:09] https://phabricator.wikimedia.org/P7709 [18:04:10] locally [18:04:25] heh [18:04:37] James_F: this next one we are going to pause at fir a bit once it lands right? [18:05:02] addshore: Yeah, I'm not enabling until this afternoon if that works for you. [18:05:11] sounds good [18:05:12] (03CR) 10jenkins-bot: Install but don't enable the WikibaseMediaInfo extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:05:14] (03CR) 10jenkins-bot: Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester) [18:05:16] (03CR) 10jenkins-bot: Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:05:32] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: [Beta Cluster] Load but don't enable MediaInfo on Beta Commons cf. T180981 (duration: 00m 45s) [18:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:35] James_F: but the last one you did enabled wikibase right? [18:05:36] T180981: Deploy WikibaseMediaInfo extension to beta - https://phabricator.wikimedia.org/T180981 [18:05:54] addshore: Enabled Wikibase "repo" without any special config on Beta Cluster Commons, yes. [18:06:04] lovely [18:06:08] OK, the world hasn't ended, so I'll do the real SWATs. [18:06:14] * addshore will wait for jenkins and ci to deploy it to beta [18:07:03] Yeah, no need to rush it. [18:07:58] RECOVERY - Host analytics1068 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [18:08:37] (03CR) 10Jforrester: "I plan to deploy this in a few hours' time, once Adam is happy with how Wikibase is working." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:09:27] * addshore waits for https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/225531/console :) [18:11:02] (03PS7) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [18:16:01] (03CR) 10Dzahn: [C: 031] icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/469021 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [18:16:52] (03PS1) 10Bstorm: sonofgridengine: add missing template for toolforge base profile [puppet] - 10https://gerrit.wikimedia.org/r/469044 (https://phabricator.wikimedia.org/T200557) [18:17:22] * addshore can not swat today, /me hasn't seen anyone that is yet? [18:17:26] (03PS2) 10Bstorm: sonofgridengine: add missing template for toolforge base profile [puppet] - 10https://gerrit.wikimedia.org/r/469044 (https://phabricator.wikimedia.org/T200557) [18:18:01] jdlrobson: Live on mwdebug1002. [18:18:08] addshore: ? I'm doing it. [18:18:09] on it [18:18:22] RoanKattouw: You around or should I test without you? [18:18:32] I'm here [18:18:34] (03PS3) 10CRusnov: icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/469021 (https://phabricator.wikimedia.org/T207009) [18:18:45] Cool. Sadly MW hasn't merged yet. [18:18:49] (03CR) 10Bstorm: [C: 032] sonofgridengine: add missing template for toolforge base profile [puppet] - 10https://gerrit.wikimedia.org/r/469044 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [18:19:29] (03CR) 10CRusnov: [C: 032] icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/469021 (https://phabricator.wikimedia.org/T207009) (owner: 10CRusnov) [18:19:43] James_F: I have one more patch to add to the queue if there's time. I'm also happy to deploy that myself once you're done if you prefer. [18:20:33] (03PS4) 10CRusnov: icinga: add crusnov to icinga stuff [puppet] - 10https://gerrit.wikimedia.org/r/469021 (https://phabricator.wikimedia.org/T207009) [18:21:25] James_F: aaaah, i hadn't seen any of the patches geta +2, but I guess they arent config one [18:21:46] James_F: are they both live? [18:21:48] addshore: Yeah, all backports except Niharika's. [18:21:50] as im seeing some issues [18:21:57] or just one of them? [18:21:58] jdlrobson: They should be both live on mwdebug1002. [18:22:33] jdlrobson: `master` has all submodules up-to-date. [18:22:44] (03PS1) 10Andrew Bogott: Horizon: Move 'integration' and 'shinken' projects to eqiad1. [puppet] - 10https://gerrit.wikimedia.org/r/469048 [18:22:46] jdlrobson: I've re-pulled. [18:24:57] (03CR) 10Jforrester: [C: 032] Adding TemplateWizard to Beta Features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [18:25:08] Niharika: I'll do yours whilst waiting for code to merge, if you're ready. [18:25:26] (03PS2) 10Jforrester: Adding TemplateWizard to Beta Features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [18:25:32] (03CR) 10Jforrester: [C: 032] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [18:25:36] James_F: Sure. [18:27:34] (03Merged) 10jenkins-bot: Adding TemplateWizard to Beta Features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [18:27:43] James_F: still testing. just want to be 100% sure [18:28:02] jdlrobson: No worries. :-) [18:28:28] Niharika: Live on mwdebug1002. [18:29:03] RoanKattouw: Live on mwdebug1002. [18:30:11] James_F: Good to go. [18:30:15] Kk. [18:30:34] James_F: I have one more patch to add to the queue if there's time. I'm also happy to deploy that myself once you're done if you prefer. -- in case you missed this earlier. [18:30:55] Niharika: Go for it. [18:31:10] But add to https://wikitech.wikimedia.org/wiki/Deployments#Week_of_October_22nd please. [18:31:27] Sure. Adding now. [18:31:37] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Add TemplateWizard to the BF allow list T205290 (duration: 00m 48s) [18:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:40] T205290: Convert TemplateWizard into a beta feature - https://phabricator.wikimedia.org/T205290 [18:33:00] James_F: ok sync [18:33:23] James_F: Done. It's this one - https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/468730/ [18:33:38] jdlrobson: Doing. [18:33:42] James_F: Works [18:35:00] RoanKattouw: Ta, one second. [18:35:04] Niharika: You'll be next. [18:35:27] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.26/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.readingDepth.js: SWAT Fix reading depth logging part 1 T207423 (duration: 00m 46s) [18:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:30] T207423: Many errors on ReadingDepth.enable (?) schema - https://phabricator.wikimedia.org/T207423 [18:36:49] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.26/skins/MinervaNeue/resources/skins.minerva.scripts/pageIssuesLogger.js: SWAT Fix reading depth logging part 2 T207423 (duration: 00m 46s) [18:36:51] jdlrobson: Should now be everywhere. [18:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:09] James_F: k thanks! [18:37:26] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10mobrovac) [18:37:50] (03CR) 10jenkins-bot: Adding TemplateWizard to Beta Features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [18:37:58] (03PS2) 10Jforrester: Deploy TemplateWizard everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468730 (owner: 10Niharika29) [18:38:06] (03CR) 10Jforrester: [C: 032] Deploy TemplateWizard everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468730 (owner: 10Niharika29) [18:38:22] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.26/resources/src/mediawiki.rcfilters/styles/mw.rcfilters.ui.ChangesListWrapperWidget.highlightCircles.seenunseen.less: SWAT RCFIlters: Fix highlight circles for unseen changes T207472 (duration: 00m 46s) [18:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:25] T207472: Watchlist is no longer showing seen indicators and point types have become small - https://phabricator.wikimedia.org/T207472 [18:38:32] RoanKattouw: Should now be everywhere. [18:39:09] Looks good, thanks [18:39:40] (03Merged) 10jenkins-bot: Deploy TemplateWizard everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468730 (owner: 10Niharika29) [18:39:56] (03CR) 10Jforrester: [C: 04-2] Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:40:29] Niharika: Live on mwdebug1002. [18:41:10] James_F: Testing. [18:42:15] ACKNOWLEDGEMENT - Device not healthy -SMART- on cloudvirt1019 is CRITICAL: cluster=misc device={cciss,6,cciss,7,cciss,8,cciss,9} instance=cloudvirt1019:9100 job=node site=eqiad Cas Rusnov https://phabricator.wikimedia.org/T196507 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1019&var-datasource=eqiad%2520prometheus%252Fops [18:43:15] (03PS1) 10Jforrester: [Beta Cluster] Re-disable WBMI on Beta Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469055 (https://phabricator.wikimedia.org/T180981) [18:43:19] 10Operations, 10Patch-For-Review: Onboarding Cas Rusnov - https://phabricator.wikimedia.org/T207009 (10crusnov) [18:43:31] (03CR) 10Jforrester: [C: 032] [Beta Cluster] Re-disable WBMI on Beta Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469055 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:43:55] [= [18:45:10] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T207398 (10crusnov) [18:45:13] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10crusnov) [18:45:21] (03Merged) 10jenkins-bot: [Beta Cluster] Re-disable WBMI on Beta Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469055 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:48:51] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: [Beta] Temporarily disable WBMI from Beta Commons whilst Wikibse is fixed T180981 (duration: 00m 46s) [18:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:55] T180981: Deploy WikibaseMediaInfo extension to beta - https://phabricator.wikimedia.org/T180981 [18:50:58] !log jforrester@deploy1001 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [18:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:26] (03PS1) 10Jforrester: Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469059 (https://phabricator.wikimedia.org/T180981) [18:51:50] (03CR) 10Jforrester: [C: 04-2] Revert "[Beta Cluster] Re-disable WBMI on Beta Commons for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469059 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:53:00] (03PS4) 10Jforrester: Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) [18:53:27] (03CR) 10jenkins-bot: Deploy TemplateWizard everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468730 (owner: 10Niharika29) [18:53:29] (03CR) 10jenkins-bot: [Beta Cluster] Re-disable WBMI on Beta Commons for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469055 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [18:54:42] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Deploy TemplateWizard everywhere T202545, re-try (duration: 00m 45s) [18:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:45] T202545: Deploy TemplateWizard - https://phabricator.wikimedia.org/T202545 [18:55:27] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) > Out of curiosity, I would expect some shortage of public IPv4 addressing, is not the case? We need to be careful with our public IPs indeed, but this is... [19:00:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:00:19] 10Operations, 10fundraising-tech-ops, 10netops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10Dzahn) @cwdent has the send_nsca part also been done? [19:12:17] 10Operations, 10Cloud-Services, 10Kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416 (10akosiaris) 05Open>03Resolved a:03akosiaris Per T158583, our reprepro now supports multiple components. docker-engine is now moved to `thi... [19:19:50] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) We currently have the following relevant vlans: - private: 10/8 IPs, not reachable from cloud instances and the... [19:20:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) a:05ayounsi>03Ottomata [19:22:06] (03PS1) 10Alexandros Kosiaris: docker::engine: Remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/469066 [19:24:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10faidon) How many servers are we talking about both right now, as well as in the mid-term e.g. in the next year or two? H... [19:24:38] (03CR) 10jerkins-bot: [V: 04-1] docker::engine: Remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/469066 (owner: 10Alexandros Kosiaris) [19:31:50] 10Operations, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-engine 17+ - https://phabricator.wikimedia.org/T207693 (10akosiaris) [19:35:26] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Right now, 8: 3 ganeti instances and 5 bare metal worker nodes. We wouldn't be adding more nodes in within thi... [19:36:35] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Where are the labsdb hosts going to live if they are being moved out of the labs-support VLAN? Likely these cl... [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T2000). [20:00:11] I have deploy for ores [20:02:10] (03PS2) 10Alexandros Kosiaris: docker::engine: Remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/469066 [20:02:43] 10Operations, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10akosiaris) [20:02:51] 10Operations, 10Kubernetes: Evaluate (and potentially implement) upgrade of docker-engine to docker-ce 17+ for production (kubernetes) - https://phabricator.wikimedia.org/T207693 (10akosiaris) p:05Triage>03Normal [20:06:48] (03PS1) 10Bstorm: sonofgridengine: Fix a typo, clean up a type we will never use and more. [puppet] - 10https://gerrit.wikimedia.org/r/469076 (https://phabricator.wikimedia.org/T200557) [20:06:57] !log ladsgroup@deploy1001 Started deploy [ores/deploy@e89e880]: Use redis task tracker (T152012) [20:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:01] T152012: Silence or address E_WOULDBLOCK warning - https://phabricator.wikimedia.org/T152012 [20:07:48] (03PS3) 10Bstorm: Add rate limiting to toollabs::mailrelay with warn action [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [20:13:30] (03CR) 10Bstorm: [C: 032] sonofgridengine: Fix a typo, clean up a type we will never use and more. [puppet] - 10https://gerrit.wikimedia.org/r/469076 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:16:15] PROBLEM - HHVM rendering on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:15] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 79837 bytes in 0.312 second response time [20:22:51] (03PS1) 10Ladsgroup: Revert "Revert back wikidata for change_tag backend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469115 [20:24:35] (03PS1) 10Dzahn: icinga: nsca_user,nsca_group with variables to unbreak on stretch [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) [20:24:55] (03PS2) 10Ladsgroup: Revert "Revert back wikidata for change_tag backend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469115 [20:25:18] (03PS2) 10Dzahn: icinga: avoid hardcoded nsca_user,nsca_group for NSCA service [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) [20:26:24] 10Operations, 10Cloud-Services, 10netops: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10faidon) It's sad to hear that's a major disruption :( Would it make sense to do this now when it's early in the migration and relatively few projects have migrated... [20:28:59] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@e89e880]: Use redis task tracker (T152012) (duration: 22m 02s) [20:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:02] T152012: Silence or address E_WOULDBLOCK warning - https://phabricator.wikimedia.org/T152012 [20:33:48] (03PS3) 10Dzahn: icinga: avoid hardcoded nsca_user,nsca_group for NSCA service [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) [20:36:08] (03PS4) 10Dzahn: icinga: avoid hardcoded nsca_user,nsca_group for NSCA service [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) [20:38:23] (03PS5) 10Dzahn: icinga: avoid hardcoded nsca_user,nsca_group for NSCA service [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) [20:41:09] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1001/13145/" [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:41:35] (03PS1) 10Bstorm: sonofgridengine: puppetdb known-hosts conflicts with grid custom version [puppet] - 10https://gerrit.wikimedia.org/r/469118 (https://phabricator.wikimedia.org/T200557) [20:43:36] (03Abandoned) 10Faidon Liambotis: Remove status.wikimedia.org monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/446358 (https://phabricator.wikimedia.org/T199816) (owner: 10Faidon Liambotis) [20:43:43] (03Abandoned) 10Faidon Liambotis: Remove status.wikimedia.org A/AAAA [dns] - 10https://gerrit.wikimedia.org/r/446359 (https://phabricator.wikimedia.org/T199816) (owner: 10Faidon Liambotis) [20:44:13] (03CR) 10Cwhite: [C: 031] icinga: avoid hardcoded nsca_user,nsca_group for NSCA service [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:47:28] (03PS11) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [20:51:09] (03CR) 10Bstorm: [C: 032] sonofgridengine: puppetdb known-hosts conflicts with grid custom version [puppet] - 10https://gerrit.wikimedia.org/r/469118 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:51:59] (03CR) 10Dzahn: [C: 032] icinga: avoid hardcoded nsca_user,nsca_group for NSCA service [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:52:07] (03PS6) 10Dzahn: icinga: avoid hardcoded nsca_user,nsca_group for NSCA service [puppet] - 10https://gerrit.wikimedia.org/r/469117 (https://phabricator.wikimedia.org/T202782) [20:52:37] !log ayounsi@deploy1001 Started deploy [librenms/librenms@737683a]: Upgreade LibreNMS to 1.44 - T207481 [20:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:48] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@737683a]: Upgreade LibreNMS to 1.44 - T207481 (duration: 00m 10s) [20:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:35] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [20:55:04] (03PS9) 10Cwhite: hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) [20:56:35] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:56:49] (03CR) 10Cwhite: "> This collector seems to be in active use on the "Phabricator"" [puppet] - 10https://gerrit.wikimedia.org/r/466988 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [20:58:10] (03PS2) 10Cwhite: hiera: remove diamond from labweb role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) [20:58:53] (03CR) 10jerkins-bot: [V: 04-1] hiera: remove diamond from labweb role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [21:00:01] (03PS3) 10Cwhite: hiera: remove diamond from labweb role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) [21:00:04] bawolff and Reedy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T2100). [21:00:05] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 284 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:01:34] (03CR) 10GTirloni: [C: 032] hiera: diamond::remove on openstack control role [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [21:02:14] (03PS1) 10Jdlrobson: Disable page issues A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469121 [21:02:33] (03CR) 10GTirloni: [C: 032] "Gerrit is not happy. I don't know why it refuses to merge while locally I can merge into production just fine." [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [21:04:12] (03CR) 10Bstorm: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [21:06:30] (03PS1) 10Herron: confluent::kafka::common support thirdparty/confluent on jessie [puppet] - 10https://gerrit.wikimedia.org/r/469122 (https://phabricator.wikimedia.org/T206454) [21:06:32] (03PS1) 10Herron: aptrepo: add thirdparty/confluent component for jessie [puppet] - 10https://gerrit.wikimedia.org/r/469123 (https://phabricator.wikimedia.org/T206454) [21:06:34] (03PS1) 10Herron: logstash: set logging kafka package version to 1.1.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/469124 (https://phabricator.wikimedia.org/T206454) [21:08:17] !log rebooting cloudvirt1023 [21:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:03] !log ayounsi@deploy1001 Started deploy [librenms/librenms@0fd8da6]: Revert LibreNMS upgrade - T207481 [21:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:11] !log ayounsi@deploy1001 Finished deploy [librenms/librenms@0fd8da6]: Revert LibreNMS upgrade - T207481 (duration: 00m 08s) [21:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:25] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [21:16:18] (03CR) 10Dzahn: "yea, it's the "Submitted together (2)" part where one depends on the other. you can either merge them in the same order as uploaded or bre" [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [21:19:35] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:20:23] (03CR) 10Pmiazga: [C: 04-1] Disable page issues A/B test on beta cluster (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469121 (owner: 10Jdlrobson) [21:20:36] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:21:46] PROBLEM - DPKG on cloudvirt1022 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:24:18] (03PS2) 10Andrew Bogott: Horizon: Move 'integration' and 'shinken' projects to eqiad1. [puppet] - 10https://gerrit.wikimedia.org/r/469048 [21:24:20] (03PS1) 10Andrew Bogott: nova: remove 'ferm' from compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/469126 [21:28:04] (03CR) 10Andrew Bogott: [C: 032] Horizon: Move 'integration' and 'shinken' projects to eqiad1. [puppet] - 10https://gerrit.wikimedia.org/r/469048 (owner: 10Andrew Bogott) [21:28:27] RECOVERY - DPKG on cloudvirt1022 is OK: All packages OK [21:31:05] (03CR) 10Andrew Bogott: [C: 032] nova: remove 'ferm' from compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/469126 (owner: 10Andrew Bogott) [21:32:56] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Krenair) It's the Sat, Oct 20, 16:58 UTC comment in the parent task [21:34:18] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Dzahn) [21:34:37] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Dzahn) [21:34:54] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Dzahn) p:05Triage>03Normal [21:35:39] (03CR) 10Dzahn: [C: 032] Add punjabiwikimedia [dns] - 10https://gerrit.wikimedia.org/r/468812 (https://phabricator.wikimedia.org/T207583) (owner: 10Urbanecm) [21:37:49] (03CR) 10Dzahn: [C: 032] "edited ticket to add the discussion on the name and why no ISO lang code was used. also per "@satdeep_gill approved use of verbatim punja" [dns] - 10https://gerrit.wikimedia.org/r/468812 (https://phabricator.wikimedia.org/T207583) (owner: 10Urbanecm) [21:38:36] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Dzahn) thanks! edited ticket description, added comment on Gerrit, deployed, added to DNS [21:39:40] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Dzahn) a:03Dzahn [21:40:35] (03PS4) 10Dzahn: Add punjabi.wikimedia.org to Apache [puppet] - 10https://gerrit.wikimedia.org/r/468814 (https://phabricator.wikimedia.org/T207583) (owner: 10Urbanecm) [21:41:43] (03CR) 10Dzahn: [C: 032] "@satdeep_gill approved use of verbatim punjabi.wikimedia.org and https://phabricator.wikimedia.org/T204477#4682763" [puppet] - 10https://gerrit.wikimedia.org/r/468814 (https://phabricator.wikimedia.org/T207583) (owner: 10Urbanecm) [21:44:26] (03PS2) 10Pmiazga: beta: Disable page issues A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469121 (owner: 10Jdlrobson) [21:44:32] !log adding new prod ServerAlias punjabi.wikimedia.org to Apache cluster (T207583) [21:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:36] T207583: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 [21:45:00] hey, I need to swat the beta cluster change, is anyone working on anything right now? [21:45:14] jouncebot: now [21:45:14] For the next 1 hour(s) and 14 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T2100) [21:45:30] raynor: probably not, I don't think security is deploying anything today [21:45:33] well, i just touched the apache config [21:45:36] but not really [21:45:59] I didn't know we have that "now" command. it's pretty neath, thx. [21:46:05] puppet will run and add a new ServerAlias is all [21:46:36] I'm going to merge beta config change, I just need to pull the config on deploy and sync initializeSettings-labs file, it's no-op [21:47:01] btw, legoktm mutante - can you explain me when to use `-` in the InitializeSettings-labs.php file? [21:47:26] more context? [21:47:26] Im not [21:47:39] deploying anything that is [21:47:50] I know that `-` works with arrays, but honestly, every config is array (first level is the wiki definition, ex: `default`, `enwiki`, `wikipedia`, etc) [21:48:02] legoktm, maybe different way, let's use some task as an example [21:48:07] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/469121/ [21:48:23] we need to override the `wgMinervaABSamplingRate` from InitializeSettings.php [21:48:42] the production config has settings for default, lvwiki, jawiki, and couple more [21:48:51] # - Prefix a setting key with '-' to override all values from [21:48:51] # production InitialiseSettings.php. [21:48:59] at the top of InitialiseSettings-labs.php [21:49:07] oh, I forgot about that hack [21:49:40] Krenair - yes, exactly, so if I do that then it will remove the 'lvwiki, jawiki` and other for wgMinervaABSamplingRate, right? [21:49:58] looks like it [21:50:02] and if I do just `wgMinervaABSamplingRate` -> then it will keep the lvwiki,jawiki and other [21:50:03] look at how it's implemented [21:50:51] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/InitialiseSettings-labs.php#61 [21:51:09] normally stuff in there gets merged with the prod var of the same name [21:51:33] ok, so I was right. Thx Krenair [21:51:37] if you stick a minus at the beginning, the prod version gets overwritten completely [21:51:52] I just wanted to double check that, we have so much magic that it's better to check with someone who knows that magic :) [21:52:34] (03CR) 10Pmiazga: [C: 032] beta: Disable page issues A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469121 (owner: 10Jdlrobson) [21:52:46] I'm sure there's some hidden gotchas which I've yet to encounter [21:52:57] for which there is a whole extra layer of magic to resolve [21:53:08] (there's always a whole extra layer of magic) [21:53:39] (03Merged) 10jenkins-bot: beta: Disable page issues A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469121 (owner: 10Jdlrobson) [21:54:27] Lol. N+1 layers of magic. Sounds right [21:56:15] PROBLEM - High lag on wdqs1003 is CRITICAL: 3652 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:01:43] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Move labmon (Graphite, StatsD) into a Cloud VPS - https://phabricator.wikimedia.org/T207543 (10faidon) +1 to this task. The rationale it the same as the parent tasks and is not new: anything that crosses the VPS instance -> production barrier is uns... [22:01:51] will take a closer look later today.. < +1! [22:01:59] "Lol. N+1 layers of magic. Sounds right" < +1 [22:02:04] copy/paste fail ;-) [22:02:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Krenair) (also, the `labs_metal` hiera var in hieradata/common.yaml could be emptied and the metaldns/metal_resolver stuff jettisoned from puppet) [22:04:00] (03CR) 10jenkins-bot: beta: Disable page issues A/B test on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469121 (owner: 10Jdlrobson) [22:04:39] (03CR) 10Alexandros Kosiaris: [C: 031] "PCC says it's noop in production https://puppet-compiler.wmflabs.org/compiler1002/13143/" [puppet] - 10https://gerrit.wikimedia.org/r/469066 (owner: 10Alexandros Kosiaris) [22:08:22] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10faidon) >>! In T207533#4683780, @Andrew wrote: > My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and... [22:08:40] Krenair, oner more question, if I'm syncing the InitializeSettings-labs.php on prod, what do I put in the scap sync message, standard: `'SWAT: [[gerrit:[GERRIT-NUMBER]|[COMMIT-MESSAGE] ([PHABRICATOR-TASK])]]'` or sth else? [22:09:24] I though about adding `BETA: [...]` but maybe there is some convention that I cannot find [22:09:27] raynor, sounds decent. could call out that it's labs-only [22:12:09] !log pmiazga@deploy1001 Synchronized wmf-config//InitialiseSettings-labs.php: SWAT: [[gerrit:469121|beta: Disable page issues A/B test on beta cluster only (T200792)]] (duration: 00m 46s) [22:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:13] T200792: Run A/B test on page issues (Farsi, Japanese, Russian, English) - https://phabricator.wikimedia.org/T200792 [22:12:45] PROBLEM - very high load average likely xfs on ms-be2017 is CRITICAL: CRITICAL - load average: 186.01, 111.80, 55.69 [22:13:36] PROBLEM - MD RAID on ms-be2017 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [22:13:37] ACKNOWLEDGEMENT - MD RAID on ms-be2017 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207713 [22:13:42] 10Operations, 10ops-codfw: Degraded RAID on ms-be2017 - https://phabricator.wikimedia.org/T207713 (10ops-monitoring-bot) [22:15:56] PROBLEM - Host ms-be2017 is DOWN: PING CRITICAL - Packet loss = 100% [22:26:38] (03PS1) 10Bstorm: gridengine: webgrid exec nodes should use the jobkill script [puppet] - 10https://gerrit.wikimedia.org/r/469129 (https://phabricator.wikimedia.org/T153281) [22:28:15] (03CR) 10Bstorm: [C: 032] gridengine: webgrid exec nodes should use the jobkill script [puppet] - 10https://gerrit.wikimedia.org/r/469129 (https://phabricator.wikimedia.org/T153281) (owner: 10Bstorm) [22:45:16] (03PS1) 10Dzahn: icinga/nsca: use systemd::service, unit file by systemd-sysv-generator [puppet] - 10https://gerrit.wikimedia.org/r/469130 (https://phabricator.wikimedia.org/T202782) [22:46:12] (03CR) 10jerkins-bot: [V: 04-1] icinga/nsca: use systemd::service, unit file by systemd-sysv-generator [puppet] - 10https://gerrit.wikimedia.org/r/469130 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [22:48:03] (03PS2) 10Dzahn: icinga/nsca: use systemd::service, unit file by systemd-sysv-generator [puppet] - 10https://gerrit.wikimedia.org/r/469130 (https://phabricator.wikimedia.org/T202782) [22:52:14] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13148/" [puppet] - 10https://gerrit.wikimedia.org/r/469130 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181022T2300). [23:00:04] James_F and Amir1: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:10] o/ [23:00:27] Oh, oops. Ignore mine. [23:02:40] I guess I can do SWAY [23:02:44] *SWAT [23:04:36] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 191 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:16:37] (03CR) 10Dzahn: icinga: add puppet types for parameters (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [23:16:45] (03PS2) 10Dzahn: icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 [23:17:38] (03CR) 10jerkins-bot: [V: 04-1] icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [23:20:29] (03PS3) 10Dzahn: icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 [23:21:14] (03CR) 10jerkins-bot: [V: 04-1] icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 (owner: 10Dzahn) [23:21:37] (03PS4) 10Dzahn: icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 [23:23:49] (03PS3) 10Dzahn: Switch aptrepo::rsync to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467982 (owner: 10Muehlenhoff) [23:25:40] (03CR) 10Dzahn: [C: 032] Switch aptrepo::rsync to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/467982 (owner: 10Muehlenhoff) [23:29:02] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@d4692ea]: Redeploy Updater for T207673 [23:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:05] T207673: Editing a gloss creates a second triple for the same language in the query service - https://phabricator.wikimedia.org/T207673 [23:30:56] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:32:33] I can't install anything on ores-web-03.eqiad.wmflabs, puppet agent fails for the same reason saying it can't find even obvious packages. I guess ^ is related [23:33:02] Amir1 one of the debian mirrors is failing? [23:33:05] i think [23:33:33] probably [23:33:45] Err:11 http://cdn-fastly.deb.debian.org/debian stretch Release [23:33:46] 404 Not Found [IP: 151.101.200.204 80] [23:34:47] no, install2002 is unrelated to that [23:34:52] that's known by the Debian folks fwiw [23:35:02] install2002 is what i merged [23:35:07] i am already debugging it and then got a call [23:35:07] it's failing [23:35:12] mutante: ok [23:35:14] thanks! [23:35:54] the 404 error above is unrelated indeed [23:39:13] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@d4692ea]: Redeploy Updater for T207673 (duration: 10m 12s) [23:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:17] T207673: Editing a gloss creates a second triple for the same language in the query service - https://phabricator.wikimedia.org/T207673 [23:39:45] ACKNOWLEDGEMENT - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn switch to rsync auto ferm - puppet [23:47:43] Come on jenkins [23:48:34] akosiaris: mutante I'm deploying two patches and I'm monitoring everything but five hours from now, anything started to go down (replication/read-only/performance) feel free to revert both of them [23:48:52] (03PS1) 10Dzahn: rsync::module:: add missing parameter auto_ferm_ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/469140 [23:49:29] (03CR) 10jerkins-bot: [V: 04-1] rsync::module:: add missing parameter auto_ferm_ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/469140 (owner: 10Dzahn) [23:50:44] (03PS2) 10Dzahn: rsync::module:: add missing parameter auto_ferm_ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/469140 [23:52:12] greg-g: hey, around? [23:57:18] (03CR) 10Dzahn: "the description of the parameter was added, but not the actual parameter (for Ipv6)" [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [23:57:25] Amir1: kinda, what's up? [23:59:03] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.26/includes/changetags/ChangeTags.php: SWAT: [[gerrit:469114|Fix bad join on ChangeTag subquery (T207313)]] (duration: 00m 47s) [23:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:06] T207313: Some administrative and log actions on Wikidata take longer than 60 seconds and time out - https://phabricator.wikimedia.org/T207313 [23:59:10] greg-g: I'm deploying two changes that improve the change_tag performance stuff (fixing a high priority bug) and I will be monitoring everything [23:59:15] (03CR) 10Dzahn: [C: 032] rsync::module:: add missing parameter auto_ferm_ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/469140 (owner: 10Dzahn) [23:59:30] but if in several hours things started to go bad, keep in mind and revert them [23:59:36] ah, I see, I can't garauntee where I'll be in the next 30 minutes