[00:00:05] twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T0000). [00:00:16] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [00:01:16] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [00:32:47] !log restarting apache on phab1001 to apply b3bfff1138d1212b318392b5e18ac0bfd6f78108 [00:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:16] PROBLEM - Backup of s8 in eqiad on db1115 is CRITICAL: Backup for s8 at eqiad taken more than 8 days ago: Most recent backup 2018-10-10 01:36:37 [01:47:11] 10Operations, 10Traffic, 10Performance-Team (Radar): Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) - https://phabricator.wikimedia.org/T207340 (10BBlack) Yeah I think @Bawolff's explanation seems plausible. If they's a DNS hijacking "transparent" proxy which returns the... [02:25:20] (03PS6) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [02:41:34] (03PS7) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [03:07:07] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [03:07:44] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) [frack::puppet::private] 35f24cf add jkim user [03:08:02] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) [frack::puppet] b856bf6 add jkim user [03:10:36] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:32:57] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 878.28 seconds [03:33:06] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) @jkim_wikimedia ok, you are all set up with a shell account on frdev1001 and mysql access. In order to log... [03:51:57] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 280.61 seconds [04:49:15] (03PS2) 10MGChecker: Allow creation of TemplateStyles in Module namspace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467123 (https://phabricator.wikimedia.org/T200914) [05:00:19] (03CR) 10Legoktm: "Per T200914#4514540, since this change is suppose to be meant to be done in general, it really should go into the Scribunto extension, sim" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467123 (https://phabricator.wikimedia.org/T200914) (owner: 10MGChecker) [05:05:03] !log start office-DC link renumbering - T205985 [05:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:07] T205985: Renumber office-DC interconnect link - https://phabricator.wikimedia.org/T205985 [05:25:37] RECOVERY - MariaDB Slave SQL: s8 on db1124 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:26:18] mmmm banyek|away did you touch that ^ [05:26:46] No. [05:26:48] PROBLEM - puppet last run on labvirt1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [05:26:56] Maybe it was jaime even if he is not on irc [05:27:12] Probably [05:34:14] !log Restarting a failed s8 backup from dbstore1001 to db1116:3318 [05:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:47] (03PS2) 10Giuseppe Lavagetto: role::deployment_server: use profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/467987 [05:44:51] ACKNOWLEDGEMENT - Backup of s8 in eqiad on db1115 is CRITICAL: Backup for s8 at eqiad taken more than 8 days ago: Most recent backup 2018-10-10 01:36:37 Marostegui s8 backup being generated manually - The acknowledgement expires at: 2018-10-19 13:44:31. [05:44:56] PROBLEM - MariaDB Slave SQL: s8 on db1124 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.page_restrictions: Cant find record in page_restrictions, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1087-bin.003074, end_log_pos 228324962 [05:50:05] 10Operations, 10netops, 10Patch-For-Review: Renumber office-DC interconnect link - https://phabricator.wikimedia.org/T205985 (10ayounsi) 05Open>03Resolved the re-numbering went as expected, BGP sessions are back up. The failover tests were not done, as the exact links needs to be properly identified on t... [05:52:18] RECOVERY - puppet last run on labvirt1007 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:21:50] (03CR) 10Marostegui: [C: 04-1] "Check comments inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [06:26:27] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:26:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:27] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:28:46] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:35] (03PS1) 10Elukey: role::an_cluster::hadoop::ui: add hive settings to hiera [puppet] - 10https://gerrit.wikimedia.org/r/468211 [06:32:53] (03PS1) 10Elukey: Revert "Revert "role::prometheus::ops: collect memcached stats from thumbor/swift"" [puppet] - 10https://gerrit.wikimedia.org/r/468218 [06:32:56] (03CR) 10Elukey: [C: 032] role::an_cluster::hadoop::ui: add hive settings to hiera [puppet] - 10https://gerrit.wikimedia.org/r/468211 (owner: 10Elukey) [06:33:06] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:26] (03PS2) 10Elukey: Revert "Revert "role::prometheus::ops: collect memcached stats from thumbor/swift"" [puppet] - 10https://gerrit.wikimedia.org/r/468218 [06:36:26] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:37:14] (03CR) 10Elukey: [C: 032] Revert "Revert "role::prometheus::ops: collect memcached stats from thumbor/swift"" [puppet] - 10https://gerrit.wikimedia.org/r/468218 (owner: 10Elukey) [06:43:16] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:17] RECOVERY - MariaDB Slave SQL: s8 on db1124 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:47:04] (03PS1) 10Marostegui: install_server: Allow install the new pc hosts [puppet] - 10https://gerrit.wikimedia.org/r/468221 (https://phabricator.wikimedia.org/T207258) [06:53:52] (03PS1) 10Muehlenhoff: Only use service_auto_restart for ircecho on Debian [puppet] - 10https://gerrit.wikimedia.org/r/468222 [06:55:41] (03CR) 10Muehlenhoff: [C: 04-2] "That won't work, service_auto_restart relies on systemd which is not available on trusty. Instead I've pushed a patch to fix ircecho as ht" [puppet] - 10https://gerrit.wikimedia.org/r/468179 (owner: 10Andrew Bogott) [06:56:49] (03CR) 10Muehlenhoff: [C: 032] Only use service_auto_restart for ircecho on Debian [puppet] - 10https://gerrit.wikimedia.org/r/468222 (owner: 10Muehlenhoff) [06:58:57] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:13:53] (03PS1) 10Urbanecm: Upload uz specific wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468225 (https://phabricator.wikimedia.org/T205226) [07:13:56] (03PS1) 10Urbanecm: Use new wordmarks in uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) [07:15:56] (03CR) 10jerkins-bot: [V: 04-1] Use new wordmarks in uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [07:18:34] (03CR) 10Banyek: [C: 031] "pc1007" [puppet] - 10https://gerrit.wikimedia.org/r/468221 (https://phabricator.wikimedia.org/T207258) (owner: 10Marostegui) [07:19:17] (03PS2) 10Marostegui: install_server: Allow install the new pc hosts [puppet] - 10https://gerrit.wikimedia.org/r/468221 (https://phabricator.wikimedia.org/T207258) [07:20:45] (03CR) 10Marostegui: [C: 032] install_server: Allow install the new pc hosts [puppet] - 10https://gerrit.wikimedia.org/r/468221 (https://phabricator.wikimedia.org/T207258) (owner: 10Marostegui) [07:23:17] PROBLEM - DPKG on notebook1003 is CRITICAL: Return code of 255 is out of bounds [07:23:37] PROBLEM - dhclient process on notebook1003 is CRITICAL: Return code of 255 is out of bounds [07:23:56] PROBLEM - Check systemd state on notebook1003 is CRITICAL: Return code of 255 is out of bounds [07:23:57] PROBLEM - MD RAID on notebook1003 is CRITICAL: Return code of 255 is out of bounds [07:24:16] PROBLEM - configured eth on notebook1003 is CRITICAL: Return code of 255 is out of bounds [07:24:37] PROBLEM - puppet last run on notebook1003 is CRITICAL: Return code of 255 is out of bounds [07:25:26] PROBLEM - Long running screen/tmux on notebook1003 is CRITICAL: Return code of 255 is out of bounds [07:26:17] hello notebook [07:26:52] oom-killer came in [07:26:59] python3 process killed, oom party [07:27:24] some spark job went bad [07:27:26] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:27:37] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [07:27:56] RECOVERY - DPKG on notebook1003 is OK: All packages OK [07:28:16] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [07:28:27] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [07:29:46] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [07:36:00] (03CR) 10Giuseppe Lavagetto: [C: 031] "compiler diffs here:" [puppet] - 10https://gerrit.wikimedia.org/r/467987 (owner: 10Giuseppe Lavagetto) [07:38:31] (03PS1) 10Addshore: Wikidata: Reduce dispatcher count to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468245 [07:39:11] * addshore is going to reduce the number of running wikidata dispatchers [07:39:24] (03CR) 10Elukey: [C: 031] "Not an expert in this part of the puppet code but the change seems sound!" [puppet] - 10https://gerrit.wikimedia.org/r/467987 (owner: 10Giuseppe Lavagetto) [07:39:59] jouncebot: now [07:39:59] No deployments scheduled for the next 2 hour(s) and 20 minute(s) [07:40:13] (03CR) 10Addshore: [C: 032] Wikidata: Reduce dispatcher count to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468245 (owner: 10Addshore) [07:40:57] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: 2.137 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [07:41:30] (03Merged) 10jenkins-bot: Wikidata: Reduce dispatcher count to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468245 (owner: 10Addshore) [07:43:21] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: Wikidata dispatch: reduce concurrent dispatchers to 2 (duration: 00m 59s) [07:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:03] (03PS4) 10Banyek: mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) [07:45:26] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: 2.04 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [07:46:08] taking a look ^ [07:47:00] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/13023/" [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [07:47:11] 10Operations, 10Traffic: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) All the power connections have been imported, note that some data can't be imported such as cable length and cable ID. See https://netbox.wikimedia.org/dcim/power-connections/?site=eqsin [07:48:02] (03CR) 10Marostegui: [C: 04-1] "As I mentioned before, also include other roles on the puppet compiler run" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [07:49:47] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [07:49:53] only ores left sending udp traffic to graphite1001 and https://gerrit.wikimedia.org/r/#/c/468182/ will fix that [07:50:56] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [07:51:29] <_joe_> godog: well done! [07:51:53] (03CR) 10Banyek: mariadb: enable replication check on Parsercache hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [07:52:24] (03CR) 10jenkins-bot: Wikidata: Reduce dispatcher count to 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468245 (owner: 10Addshore) [07:52:37] _joe_: indeed, some software on T88997 either didn't need restarting or we're not using it anymore, which is great [07:52:37] T88997: Improve graphite failover - https://phabricator.wikimedia.org/T88997 [07:55:34] (03Abandoned) 10Giuseppe Lavagetto: Revert "mediawiki::web::prod_sites: convert wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/467918 (owner: 10Giuseppe Lavagetto) [07:56:15] (03PS1) 10Elukey: profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) [07:56:20] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/13024/" [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [07:56:21] !log rebooting swift backend servers in codfw for spectre v3/v4/L1TF security updates [07:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:40] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) [07:59:55] (03PS2) 10Elukey: profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) [07:59:59] (03PS3) 10Giuseppe Lavagetto: role::deployment_server: use profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/467987 [08:00:01] (03PS1) 10Giuseppe Lavagetto: profile::openstack::base::wikitech::web: require profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/468252 [08:00:03] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove base class, superseded by profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/468253 [08:02:16] RECOVERY - statsd UDP receive errors are elevated on graphite1001 is OK: (C)2 ge (W)1 ge 0.6611 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [08:03:23] (03CR) 10Muehlenhoff: [C: 031] role::deployment_server: use profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/467987 (owner: 10Giuseppe Lavagetto) [08:03:57] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13027/" [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [08:04:34] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13026/" [puppet] - 10https://gerrit.wikimedia.org/r/468252 (owner: 10Giuseppe Lavagetto) [08:05:08] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/13028/" [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [08:05:15] (03CR) 10Giuseppe Lavagetto: [C: 032] role::deployment_server: use profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/467987 (owner: 10Giuseppe Lavagetto) [08:05:39] (03PS1) 10Gehel: maps: restrict OSM sync check to maps1001 [puppet] - 10https://gerrit.wikimedia.org/r/468257 (https://phabricator.wikimedia.org/T205462) [08:08:47] (03PS5) 10Banyek: mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) [08:09:43] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::openstack::base::wikitech::web: require profile::mediawiki::common [puppet] - 10https://gerrit.wikimedia.org/r/468252 (owner: 10Giuseppe Lavagetto) [08:10:07] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: 5.473 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [08:10:55] 10Operations, 10Traffic: Deprecate pybal SSH health checks - https://phabricator.wikimedia.org/T111899 (10MoritzMuehlenhoff) Are very ready to deprecate this now? We have disk health checks in place via Icinga for a while now which would warn us about faulty disks. [08:11:31] <_joe_> moritzm: oh fair point [08:11:35] <_joe_> I think we are, yes [08:12:38] _joe_, would this be a good point to attempt to ditch the beta apache config and have it use the same puppet resources as prod? [08:12:57] <_joe_> Krenair: almost there [08:13:02] ok [08:13:08] 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Psychoslave) Done for WLL. Thank you for pointing the need. :) [08:13:14] <_joe_> not really ditching the config, but we can use mediawiki::web::vhost [08:13:21] yeah [08:14:06] <_joe_> it's more important that it's homogeneous than it is that the configs are as similar as possible [08:14:14] would we move the relevant mediawiki::web::vhost stuff to sites.pp and have it take into account $domain_suffix ? or just duplicate the resource for now? [08:14:35] <_joe_> I'd just duplicate it [08:14:38] ok [08:14:41] <_joe_> not everything is the same [08:15:59] <_joe_> so my plan is roughly: convert beta to use mediawiki::web::vhost; add the ability to select php7 based on a cookie; enable php7 in beta [08:16:22] <_joe_> I should be able to do this either today/tomorrow or early next week [08:16:46] I mean [08:17:01] I'm asking because I was thinking of supplying some patches for it [08:17:06] but [08:17:19] if you'd prefer to do it I'll leave it [08:18:16] PROBLEM - HHVM jobrunner on mw1335 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:19:02] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) >>! In T88997#3051083, @hashar wrote: > I thought `statsd.eqiad.wmnet` pointed to a service IP that would be moved from host to host but DN... [08:19:04] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [08:19:17] RECOVERY - HHVM jobrunner on mw1335 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.005 second response time [08:20:16] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: 2.087 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [08:21:31] (03CR) 10jerkins-bot: [V: 04-1] Use new wordmarks in uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [08:21:58] 10Operations, 10Traffic: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10ayounsi) About the other links, the only less straightforward interfaces are the server's uplinks as their name can't be derived from the device/port table. I'll keep the task open until at least... [08:22:27] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: 2.112 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [08:22:28] (03PS2) 10Urbanecm: Use new wordmarks in uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) [08:22:29] <_joe_> Krenair: depends on the timing; I want to install php7 in beta asap :) [08:22:40] (03CR) 10DCausse: elasticsearch: pseudo cookbook for JVM upgrade (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [08:23:05] _joe_, okay, I'll leave it to you :) [08:25:22] (03PS2) 10Urbanecm: Upload uz specific wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468225 (https://phabricator.wikimedia.org/T205226) [08:25:32] (03PS3) 10Urbanecm: Use new wordmarks in uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) [08:25:54] (03PS3) 10Elukey: profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) [08:26:12] !log Uploaded certcentral 0.1-2 to apt.wikimedia.org (stretch) [08:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:06] RECOVERY - statsd UDP receive errors are elevated on graphite1001 is OK: (C)2 ge (W)1 ge 0.6537 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [08:29:05] (03PS4) 10Elukey: profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) [08:31:25] still unclear why udp errors would spike up on graphite1001 with less traffic, I'll keep an eye on it [08:34:39] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/13033/" [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [08:35:05] (03CR) 10Banyek: "Plan for deploying the change:" [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [08:35:46] (03CR) 10Banyek: "- disable puppet across pc hosts" [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [08:36:05] (03CR) 10Marostegui: [C: 031] mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [08:37:52] 10Operations, 10Wikimedia-Logstash: Rationalize default logrotate "rotated" file extensions - https://phabricator.wikimedia.org/T207296 (10fgiunchedi) I'm +1 on `dateext` going forward, likely not worth going back and change all existing `logrotate` configs [08:38:24] (03CR) 10Mathew.onipe: [C: 031] maps: restrict OSM sync check to maps1001 [puppet] - 10https://gerrit.wikimedia.org/r/468257 (https://phabricator.wikimedia.org/T205462) (owner: 10Gehel) [08:40:12] !log adding replication monitoring checks to parsercache hosts (T206992) [08:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:16] T206992: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 [08:40:47] (03PS2) 10Muehlenhoff: Restrict ferm service package_builder_rsync to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/467976 [08:41:03] !log disabling puppet on parser caches (T206992) [08:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:35] (03PS2) 10Gehel: maps: restrict OSM sync check to maps1001 [puppet] - 10https://gerrit.wikimedia.org/r/468257 (https://phabricator.wikimedia.org/T205462) [08:44:28] (03CR) 10Gehel: [C: 032] maps: restrict OSM sync check to maps1001 [puppet] - 10https://gerrit.wikimedia.org/r/468257 (https://phabricator.wikimedia.org/T205462) (owner: 10Gehel) [08:45:14] (03PS9) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [08:45:21] (03CR) 10Banyek: [C: 032] mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [08:45:35] (03PS6) 10Banyek: mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) [08:45:44] (03CR) 10Banyek: [V: 032 C: 032] mariadb: enable replication check on Parsercache hosts [puppet] - 10https://gerrit.wikimedia.org/r/467959 (https://phabricator.wikimedia.org/T206992) (owner: 10Banyek) [08:47:19] 10Operations, 10DNS, 10GitHub-Mirrors, 10Traffic, 10Release-Engineering-Team (Kanban): Github: add verified domain - https://phabricator.wikimedia.org/T207364 (10hashar) [08:47:56] 10Operations, 10DNS, 10GitHub-Mirrors, 10Traffic, 10Release-Engineering-Team (Kanban): Github: add verified domain - https://phabricator.wikimedia.org/T207364 (10hashar) [08:54:32] (03PS1) 10Hashar: TXT entries for Github domain verification [dns] - 10https://gerrit.wikimedia.org/r/468279 (https://phabricator.wikimedia.org/T207364) [08:54:42] (03PS10) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [08:56:22] RECOVERY - Maps - OSM synchronization lag - eqiad on einsteinium is OK: (C)1.728e+05 ge (W)9e+04 ge 3.218e+04 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [08:56:36] !log enabling replication monitor check on pc2004 (T206992) [08:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:40] T206992: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 [09:00:37] (03PS11) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [09:00:39] (03PS2) 10Effie Mouzeli: WIP: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [09:01:03] !log enabling replication monitor check on pc1004 (T206992) [09:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:08] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10hashar) >>! In T88997#4676648, @fgiunchedi wrote: > Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data... [09:08:07] !log powercycling ms-be2019, stuck during reboot [09:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:58] jouncebot: next [09:10:58] In 0 hour(s) and 49 minute(s): Enable Wikidata.org Lexeme Senses (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1000) [09:14:31] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: 2.356 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [09:18:29] !log bounce statsd-proxy on graphite1001 [09:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:32] !log enabling replication monitor check on pc1005 pc1006 pc2005 pc2006 (T206992) [09:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:35] T206992: Create replication icinga check for the Parsercache hosts - https://phabricator.wikimedia.org/T206992 [09:23:18] RECOVERY - statsd UDP receive errors are elevated on graphite1001 is OK: (C)2 ge (W)1 ge 0.4471 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [09:24:43] (03CR) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [09:26:53] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Recap of what we did so far: * increased mcrouter's TCP (persistent) connections to... [09:28:15] (03PS1) 10Volans: setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/468281 [09:28:17] (03PS1) 10Volans: tests: fix lint ignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/468282 [09:29:37] (03CR) 10jerkins-bot: [V: 04-1] setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/468281 (owner: 10Volans) [09:30:15] (03CR) 10Volans: "The only failure is the handleError, fixed in the next CR in the series" [software/spicerack] - 10https://gerrit.wikimedia.org/r/468281 (owner: 10Volans) [09:30:46] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] wmcs: add prometheus-memcached-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [09:33:53] (03CR) 10Volans: [V: 032 C: 032] Upgrade Netbox to upstream v2.4.6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/467965 (https://phabricator.wikimedia.org/T205896) (owner: 10Volans) [09:44:21] (03PS1) 10Gehel: exim4: remove deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468285 [09:45:09] (03PS5) 10Filippo Giunchedi: wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) [09:45:21] (03CR) 10Filippo Giunchedi: wmcs: add prometheus-memcached-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [09:47:29] [Heads up] Netbox upgrade in 5 minutes, any blocker? [09:49:04] (03CR) 10Banyek: "I ran the productun catalog against" [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [09:49:07] RECOVERY - puppet last run on certcentral2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:49:20] yey :) [09:49:36] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/13036/" [puppet] - 10https://gerrit.wikimedia.org/r/467317 (https://phabricator.wikimedia.org/T207013) (owner: 10Jcrespo) [09:49:40] (03PS1) 10Gehel: sysctl::conffile: remove deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468286 [09:49:41] Krenair: ^^ after updating to 0.1-2 puppet run as expected in certcentral2001 [09:51:51] (03CR) 10Gehel: "puppet compiler agrees this is a NOOP: https://puppet-compiler.wmflabs.org/compiler1002/13037/" [puppet] - 10https://gerrit.wikimedia.org/r/468285 (owner: 10Gehel) [09:52:21] (03PS1) 10Addshore: Turn on Senses support for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468287 (https://phabricator.wikimedia.org/T203888) [09:52:35] !log activate bgp group Customer6 on cr4-ulsfo [09:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:48] (03PS1) 10Addshore: Remove wgLexemeEnableSenses from IS-labs (BETA ONLY) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468288 [09:54:23] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/468281 (owner: 10Volans) [09:54:27] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/468253 (owner: 10Giuseppe Lavagetto) [09:55:03] (03CR) 10Gehel: [C: 031] "LGTM (let's hope it does not break again too soon!)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/468282 (owner: 10Volans) [09:56:55] (03PS1) 10Addshore: Combine if blocks in Wikibase-production NOOP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468290 [09:57:14] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM :-)" [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [09:57:43] !log volans@deploy1001 Started deploy [netbox/deploy@438f1c0]: Upgrade to upstream v2.4.6 - T205896 [09:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:47] T205896: Netbox: upgrade to the latest version (>= 2.4) - https://phabricator.wikimedia.org/T205896 [09:59:05] (03PS1) 10Gehel: admin::user: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468291 [09:59:22] (03CR) 10Gehel: "Puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/13038/" [puppet] - 10https://gerrit.wikimedia.org/r/468286 (owner: 10Gehel) [09:59:30] jouncebot: Next [09:59:30] In 0 hour(s) and 0 minute(s): Enable Wikidata.org Lexeme Senses (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1000) [09:59:45] (03CR) 10Addshore: [C: 032] Turn on Senses support for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468287 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [10:00:04] addshore: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Enable Wikidata.org Lexeme Senses. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1000). [10:00:10] haha [10:00:15] good #bothumor [10:00:30] tarrow: ^^ hee [10:00:51] haha! [10:00:51] !log volans@deploy1001 Finished deploy [netbox/deploy@438f1c0]: Upgrade to upstream v2.4.6 - T205896 (duration: 03m 07s) [10:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:09] (03Merged) 10jenkins-bot: Turn on Senses support for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468287 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [10:01:13] [re:netbox: still missing something, we'll send a patch and redeploy] [10:01:31] vgutierrez, do you know why systemd randomly recovered on certcentral1001 last night? [10:02:05] icinga-wm said at 00:39:36 UTC+1 [10:02:28] Krenair: hmmm nope, actually is still down due to the invalid config [10:02:46] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 7 minutes ago with 6 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1],Exec[mkfs-/dev/sdg1],Exec[xfs_label-/dev/sdm3],Exec[xfs_label-/dev/sdn3] [10:03:17] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 312 bytes in 0.009 second response time [10:03:28] volans: ^^ expected? [10:03:45] vgutierrez: kinda [10:03:47] I'm fixing it [10:06:20] (03PS1) 10Volans: Fix submodule paths [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/468293 [10:07:11] (03CR) 10Volans: [V: 032 C: 032] Fix submodule paths [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/468293 (owner: 10Volans) [10:07:48] 10Operations, 10Cloud-Services, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406 (10aborrero) Agreed. [10:07:53] !log volans@deploy1001 Started deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (2) - T205896 [10:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:56] T205896: Netbox: upgrade to the latest version (>= 2.4) - https://phabricator.wikimedia.org/T205896 [10:09:54] !log volans@deploy1001 Finished deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (2) - T205896 (duration: 02m 01s) [10:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:08] damn netbox upgrade script... debugging :( [10:10:25] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable senses on wikidatawiki T203888 (duration: 00m 53s) [10:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:28] T203888: Turn on Sense support on Wikidata - https://phabricator.wikimedia.org/T203888 [10:10:49] jouncebot, next [10:10:49] In 0 hour(s) and 49 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1100) [10:11:56] !log volans@deploy1001 Started deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (3) - T205896 [10:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:25] !log volans@deploy1001 Finished deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (3) - T205896 (duration: 00m 29s) [10:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:37] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:15:17] (03CR) 10jenkins-bot: Turn on Senses support for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468287 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [10:15:36] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:15:52] !log purging wikidata lexemes [10:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:59] sorry for the netbox spam [10:16:05] netmon* too [10:19:15] (03PS1) 10Volans: netbox: update submodule path [puppet] - 10https://gerrit.wikimedia.org/r/468295 (https://phabricator.wikimedia.org/T205896) [10:20:10] (03CR) 10Volans: [C: 032] netbox: update submodule path [puppet] - 10https://gerrit.wikimedia.org/r/468295 (https://phabricator.wikimedia.org/T205896) (owner: 10Volans) [10:22:17] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [10:22:27] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [10:22:32] (03PS18) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [10:23:17] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.009 second response time [10:23:46] (03CR) 10Addshore: [C: 032] Remove wgLexemeEnableSenses from IS-labs (BETA ONLY) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468288 (owner: 10Addshore) [10:23:58] (03CR) 10Addshore: [C: 032] Combine if blocks in Wikibase-production NOOP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468290 (owner: 10Addshore) [10:25:03] (03Merged) 10jenkins-bot: Remove wgLexemeEnableSenses from IS-labs (BETA ONLY) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468288 (owner: 10Addshore) [10:25:06] (03Merged) 10jenkins-bot: Combine if blocks in Wikibase-production NOOP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468290 (owner: 10Addshore) [10:27:11] (03PS8) 10Alex Monk: Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (https://phabricator.wikimedia.org/T207372) [10:28:33] !log volans@deploy1001 Started deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (4) - T205896 [10:28:36] (03PS3) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [10:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:37] T205896: Netbox: upgrade to the latest version (>= 2.4) - https://phabricator.wikimedia.org/T205896 [10:28:38] !log volans@deploy1001 Finished deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (4) - T205896 (duration: 00m 05s) [10:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:08] (03PS11) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (https://phabricator.wikimedia.org/T207373) [10:29:46] !log mobrovac@deploy1001 Started deploy [restbase/deploy@88c8f26]: Parallelise onthisday call - T203588 [10:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:49] T203588: Feed checks timeout on RESTBase deploy - https://phabricator.wikimedia.org/T203588 [10:30:02] !log volans@deploy1001 Started deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (5) - T205896 [10:30:04] (03CR) 10Vgutierrez: [C: 04-1] Add make_account CLI script (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (https://phabricator.wikimedia.org/T207372) (owner: 10Alex Monk) [10:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:24] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: BETA ONLY Remove wgLexemeEnableSenses from IS-labs (duration: 00m 53s) [10:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:28] (03PS3) 10Alex Monk: [WIP] Check for outdated/expired certs in the main loop [software/certcentral] - 10https://gerrit.wikimedia.org/r/460397 (https://phabricator.wikimedia.org/T207374) [10:31:35] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Check for outdated/expired certs in the main loop [software/certcentral] - 10https://gerrit.wikimedia.org/r/460397 (https://phabricator.wikimedia.org/T207374) (owner: 10Alex Monk) [10:31:39] !log volans@deploy1001 Finished deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (5) - T205896 (duration: 01m 37s) [10:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:45] !log volans@deploy1001 Started deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (5) - T205896 [10:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:08] ok found the issue, scap was not configured to deploy one at the time, I'll fix that, but we should be good in a moment [10:32:14] !log volans@deploy1001 Finished deploy [netbox/deploy@1cd4d43]: Upgrade to upstream v2.4.6 (5) - T205896 (duration: 00m 29s) [10:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:52] !log addshore@deploy1001 Synchronized wmf-config/Wikibase-production.php: Combine if blocks in Wikibase-production NOOP (duration: 00m 53s) [10:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:11] (03PS1) 10Volans: netbox: fix static path for submodule [puppet] - 10https://gerrit.wikimedia.org/r/468298 [10:36:03] (03CR) 10Volans: [C: 032] netbox: fix static path for submodule [puppet] - 10https://gerrit.wikimedia.org/r/468298 (owner: 10Volans) [10:38:47] PROBLEM - MariaDB Slave Lag: s2 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.50 seconds [10:38:53] (03CR) 10jenkins-bot: Remove wgLexemeEnableSenses from IS-labs (BETA ONLY) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468288 (owner: 10Addshore) [10:38:55] (03CR) 10jenkins-bot: Combine if blocks in Wikibase-production NOOP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468290 (owner: 10Addshore) [10:39:23] db1095 is a backup source I believe? [10:39:37] no it is not [10:39:42] (03CR) 10Mobrovac: scap::target: added additional_services_names param (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [10:40:39] ah yes [10:40:40] it is [10:41:00] Amir1: your script is creating some lag on that host I think [10:41:09] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@88c8f26]: Parallelise onthisday call - T203588 (duration: 11m 24s) [10:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:12] T203588: Feed checks timeout on RESTBase deploy - https://phabricator.wikimedia.org/T203588 [10:41:22] let me check [10:41:27] is it hitting s2 now? [10:41:28] or recently? [10:41:29] it's very likely [10:41:35] Yeah, I see the deletes there [10:41:40] !log mobrovac@deploy1001 Started deploy [restbase/deploy@88c8f26]: Parallelise onthisday call, take #2 [10:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:48] the updates, sorry [10:42:01] [FYI] Netbox upgrade completed, let me know if you encounter any issue [10:42:04] and sorry for a bit of spam [10:42:10] marostegui: should I stop the thing? [10:42:27] Amir1: only the backup source is lagging, I think we are fine for now [10:42:32] I am going to downtime it for 1h [10:42:49] Thanks. I think we would hit it again too [10:43:20] when do you expect it to be finished? [10:43:21] Large wikis, probably we will have the same thing with other s2 wikis, or s7 [10:43:27] by tomorrow [10:46:23] (03PS9) 10Alex Monk: Add make_account CLI script [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (https://phabricator.wikimedia.org/T207372) [10:48:36] (03PS1) 10Volans: Deploy one host at a time [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/468299 [10:49:00] (03CR) 10Volans: [V: 032 C: 032] Deploy one host at a time [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/468299 (owner: 10Volans) [10:49:05] (03PS4) 10Effie Mouzeli: Added new role::redis::misc for general purposes redis servers [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) [10:49:06] RECOVERY - MariaDB Slave Lag: s2 on db1095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:49:11] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@88c8f26]: Parallelise onthisday call, take #2 (duration: 07m 32s) [10:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:19] !log mobrovac@deploy1001 Started deploy [restbase/deploy@88c8f26]: Parallelise onthisday call, take #3 [10:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:27] PROBLEM - DPKG on labnet1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:49:40] 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10Volans) [10:49:44] 10Operations, 10Patch-For-Review: Netbox: upgrade to the latest version (>= 2.4) - https://phabricator.wikimedia.org/T205896 (10Volans) 05Open>03Resolved a:03Volans Netbox has been upgraded to upstream 2.4.6. Report any issue you might found. [10:49:48] PROBLEM - Disk space on labnet1001 is CRITICAL: DISK CRITICAL - free space: /boot 10 MB (3% inode=99%) [10:49:58] 10Operations, 10Operations-Software-Development, 10Goal: Expand Netbox usage - Q2 2018-19 Goal - https://phabricator.wikimedia.org/T205868 (10Volans) [10:50:03] ^ that's me fixing, it [10:50:16] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [10:50:24] (03CR) 10Volans: [V: 032 C: 032] setup.py: force urllib3 version [software/spicerack] - 10https://gerrit.wikimedia.org/r/468281 (owner: 10Volans) [10:50:42] (03CR) 10Volans: [C: 032] tests: fix lint ignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/468282 (owner: 10Volans) [10:50:58] RECOVERY - Disk space on labnet1001 is OK: DISK OK [10:51:17] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [10:51:31] (03PS5) 10Elukey: profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) [10:51:41] Amir1: I think it finished on db1095 already [10:51:47] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/WikibaseLexeme: Wikidata: Make statement group IDs on Senses unique (duration: 00m 59s) [10:51:47] (03PS1) 10Addshore: RejectParserCacheValue Wikidata lexemes before sense deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468301 (https://phabricator.wikimedia.org/T203888) [10:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:01] (03CR) 10jerkins-bot: [V: 04-1] tests: fix lint ignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/468282 (owner: 10Volans) [10:52:12] it's very likely, I think it would hit it again (as there are other large wikis there) but it can wait [10:52:13] what now jenkins... [10:52:13] (03CR) 10Addshore: [C: 032] RejectParserCacheValue Wikidata lexemes before sense deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468301 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [10:52:36] (03CR) 10Volans: [C: 032] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/468282 (owner: 10Volans) [10:52:38] Amir1: yeah, it is hitting s3 now on that same host [10:52:48] RECOVERY - DPKG on labnet1001 is OK: All packages OK [10:53:10] s3 wikis should not cause mu issue, they are pretty small [10:53:16] (03Merged) 10jenkins-bot: RejectParserCacheValue Wikidata lexemes before sense deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468301 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [10:53:32] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@88c8f26]: Parallelise onthisday call, take #3 (duration: 04m 13s) [10:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:37] it is running for enwikisource [10:53:46] !log mobrovac@deploy1001 Started deploy [restbase/deploy@88c8f26]: Parallelise onthisday call, take #4 [10:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:10] !log addshore@deploy1001 sync-file aborted: RejectParserCacheValue Wikidata lexemes before sense deploymentT203888 (duration: 00m 00s) [10:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:35] (03Merged) 10jenkins-bot: tests: fix lint ignore [software/spicerack] - 10https://gerrit.wikimedia.org/r/468282 (owner: 10Volans) [10:54:39] missed the space between the commit message and the ticket... [10:54:45] (03CR) 10Volans: [C: 031] "LGTM, thanks for taking care of this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468291 (owner: 10Gehel) [10:55:09] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: RejectParserCacheValue Wikidata lexemes before sense deployment T203888 (duration: 00m 54s) [10:55:11] (03PS2) 10Volans: Fix typo in README.rst [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [10:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:19] T203888: Turn on Sense support on Wikidata - https://phabricator.wikimedia.org/T203888 [10:56:14] (03CR) 10jenkins-bot: RejectParserCacheValue Wikidata lexemes before sense deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468301 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [10:57:38] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@88c8f26]: Parallelise onthisday call, take #4 (duration: 03m 52s) [10:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:45] !log addshore@mwmaint1002:~$ mwscript purgeList.php --wiki wikidatawiki --namespace 146 [10:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:11] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13043/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [10:59:26] !log wikidata senses deploy slot done [10:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1100). [11:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] here [11:00:16] (03PS6) 10Elukey: profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) [11:00:40] * addshore can not swat [11:00:50] (03CR) 10Volans: [C: 032] Fix typo in README.rst [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [11:01:17] zeljkof, are you here to SWAT? [11:02:01] (03CR) 10jerkins-bot: [V: 04-1] Fix typo in README.rst [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [11:02:34] (03PS7) 10Elukey: profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) [11:02:40] I can SWAT today [11:05:20] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13045/" [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [11:05:34] great! [11:05:41] (03PS1) 10Addshore: Wikidata.org: enable sense data type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468304 (https://phabricator.wikimedia.org/T203888) [11:05:42] please ping me when I'll be needed [11:05:48] zeljkof: can I squeeze one more out? [11:05:52] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468225 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:05:58] sorry, left over from the previous slot, we missed 1 thing [11:06:12] if not I can wait until after :) [11:06:14] addshore: is it urgent? do you want to go now, or at the end? [11:06:23] I'll go at the end, please ping me :) [11:06:41] addshore: will do! just add the commit to the calendar so I don't forget :) [11:07:08] (03Merged) 10jenkins-bot: Upload uz specific wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468225 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:08:18] (03CR) 10Elukey: [C: 04-2] "Sorry I didn't see this! I'll take care of this after https://gerrit.wikimedia.org/r/#/c/468251/ (I am in the middle of a refactoring to m" [puppet] - 10https://gerrit.wikimedia.org/r/467320 (https://phabricator.wikimedia.org/T205940) (owner: 10Fdans) [11:08:27] Urbanecm: 468225 at mwdebug1002 [11:08:52] please deploy [11:09:04] ok [11:09:16] PROBLEM - DPKG on labsdb1009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:10:28] !log zfilipin@deploy1001 Synchronized static/images/mobile/copyright/: SWAT: [[gerrit:468225|Upload uz specific wordmark (T205226)]] (duration: 00m 54s) [11:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:34] T205226: Change English-language logos in Uzbek Wikipedia - https://phabricator.wikimedia.org/T205226 [11:10:40] Urbanecm: deployed [11:10:44] ack [11:10:49] (03CR) 10Elukey: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/467974 (owner: 10Muehlenhoff) [11:11:12] (03PS4) 10Zfilipin: Use new wordmarks in uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:11:21] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:11:23] (03CR) 10Elukey: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/467977 (owner: 10Muehlenhoff) [11:11:58] (03CR) 10jenkins-bot: Upload uz specific wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468225 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:13:05] (03Merged) 10jenkins-bot: Use new wordmarks in uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:13:43] Urbanecm: 468226 at mwdebug1002 [11:13:50] reviewing [11:14:20] (03PS6) 10Elukey: wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [11:15:10] (03CR) 10Volans: [C: 032] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [11:15:18] 10Operations, 10cloud-services-team: Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10MoritzMuehlenhoff) [11:15:34] are you sure zeljkof ? [11:15:53] Urbanecm: let me double check :/ [11:16:10] (03CR) 10Elukey: "Should we merge this? Metrics should now be automagically available to https://grafana.wikimedia.org/dashboard/db/memcache afterwards (and" [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [11:16:26] thx [11:16:46] (03Merged) 10jenkins-bot: Fix typo in README.rst [software/spicerack] - 10https://gerrit.wikimedia.org/r/467922 (owner: 10DCausse) [11:16:54] Urbanecm: fairly sure now, "Use new wordmarks in uzwiki" is at deploy1001, and I ran `scap pull` at mwdebug1002 [11:17:17] RECOVERY - Backup of s8 in eqiad on db1115 is OK: Backup for s8 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2018-10-18 09:57:37 from db1116.eqiad.wmnet:3318 (101 GB) [11:19:52] zeljkof, hmm... wait a moment, looking into it [11:21:24] I have it [11:21:30] ukwiki != uzwiki :( [11:21:43] zeljkof, ^^ [11:21:43] ah, the patch is wrong? [11:21:47] yes :D [11:21:49] I didn't notice it [11:21:53] will upload a fix, wait a sec [11:23:30] (03PS1) 10Urbanecm: [typo] Fix a typo in copyright logos definition, ukwiki => uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468307 (https://phabricator.wikimedia.org/T205226) [11:23:36] zeljkof, please review and merge [11:24:25] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468307 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:25:40] zeljkof, should I add it to the calendar? [11:25:40] (03Merged) 10jenkins-bot: [typo] Fix a typo in copyright logos definition, ukwiki => uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468307 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:25:47] Urbanecm: please do [11:25:51] will do [11:26:46] done [11:26:47] Urbanecm: 468307 at mwdebug1002 [11:27:24] now it works, please deploy [11:27:36] (03CR) 10jenkins-bot: Use new wordmarks in uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468226 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:27:38] (03CR) 10jenkins-bot: [typo] Fix a typo in copyright logos definition, ukwiki => uzwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468307 (https://phabricator.wikimedia.org/T205226) (owner: 10Urbanecm) [11:27:57] ok [11:28:45] (03CR) 10Gehel: "Puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/13039/" [puppet] - 10https://gerrit.wikimedia.org/r/468291 (owner: 10Gehel) [11:29:02] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:468226|Use new wordmarks in uzwiki (T205226)]] (duration: 00m 53s) [11:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:05] T205226: Change English-language logos in Uzbek Wikipedia - https://phabricator.wikimedia.org/T205226 [11:29:28] Urbanecm: deployed [11:30:28] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for kharlan - https://phabricator.wikimedia.org/T207330 (10Krenair) Sounds like the restricted group would be the one for this [11:31:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) (owner: 10Urbanecm) [11:31:47] zeljkof, did you say anything in last few seconds-minutes? I had internet connection problem here [11:32:13] Urbanecm: probably that 468307 is deployed :) [11:32:28] (03PS4) 10Zfilipin: Use testwikidatawiki instead of testwikidata in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) (owner: 10Urbanecm) [11:32:32] ok, thx [11:32:33] (03CR) 10Zfilipin: Use testwikidatawiki instead of testwikidata in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) (owner: 10Urbanecm) [11:32:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) (owner: 10Urbanecm) [11:33:13] !log mobrovac@deploy1001 Started deploy [restbase/deploy@1041a02]: Disable onthisday check - T203588 [11:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:16] T203588: Feed checks timeout on RESTBase deploy - https://phabricator.wikimedia.org/T203588 [11:33:44] (03Merged) 10jenkins-bot: Use testwikidatawiki instead of testwikidata in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) (owner: 10Urbanecm) [11:34:25] Urbanecm: 467509 at mwdebug1002 [11:37:54] it is untestable, as testwikidatawiki and testwikidata is the same [11:38:06] (so it does the same, it's just standardization) [11:38:14] zeljkof, ^ [11:38:19] (sorry for delay, missed the message) [11:38:28] Urbanecm: ok to deploy? [11:38:31] yes [11:39:05] deploying [11:39:21] ack [11:39:46] (03PS1) 10Effie Mouzeli: Added dummy pass for role redis::misc::master [labs/private] - 10https://gerrit.wikimedia.org/r/468310 (https://phabricator.wikimedia.org/T206450) [11:39:54] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467509|Use testwikidatawiki instead of testwikidata in IS.php (T207089)]] (duration: 00m 53s) [11:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:58] T207089: Use testwikidatawiki instead of testwikidata in IS.php - https://phabricator.wikimedia.org/T207089 [11:40:03] Urbanecm: deployed [11:40:07] thanks [11:41:05] (03PS3) 10Zfilipin: Fix typo in IS.php: use ltwiki instead of ltwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467453 (https://phabricator.wikimedia.org/T207081) (owner: 10Urbanecm) [11:41:46] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467453 (https://phabricator.wikimedia.org/T207081) (owner: 10Urbanecm) [11:43:00] (03Merged) 10jenkins-bot: Fix typo in IS.php: use ltwiki instead of ltwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467453 (https://phabricator.wikimedia.org/T207081) (owner: 10Urbanecm) [11:44:16] (03CR) 10Volans: [C: 031] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/468310 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:44:20] Urbanecm: 467453 at mwdebug1002 [11:44:49] ack [11:48:06] zeljkof, please deploy [11:48:13] ok [11:48:51] addshore: on the last patch, you're next, stand by ;) [11:49:23] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:467453|Fix typo in IS.php: use ltwiki instead of ltwikipedia (T207081)]] (duration: 00m 54s) [11:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:26] T207081: Fix typo in IS.php: use ltwiki instead of ltwikipedia - https://phabricator.wikimedia.org/T207081 [11:49:37] Urbanecm: deployed [11:49:41] ack [11:49:46] 10Operations, 10Office-IT: Request for email address seniori@wikimedia.org - https://phabricator.wikimedia.org/T160400 (10Urbanecm) [11:49:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) (owner: 10Urbanecm) [11:50:01] 10Operations, 10Commons, 10Multimedia, 10media-storage: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540 (10Urbanecm) [11:50:13] zeljkof: cool [11:50:39] (03CR) 10jenkins-bot: Use testwikidatawiki instead of testwikidata in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467509 (https://phabricator.wikimedia.org/T207089) (owner: 10Urbanecm) [11:50:43] (03CR) 10jenkins-bot: Fix typo in IS.php: use ltwiki instead of ltwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467453 (https://phabricator.wikimedia.org/T207081) (owner: 10Urbanecm) [11:51:00] 10Operations, 10Wikimedia-Mailing-lists: Non-working archive for wikimediacz-l list - https://phabricator.wikimedia.org/T205380 (10Urbanecm) [11:51:11] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Urbanecm) [11:51:19] (03CR) 10jenkins-bot: Test if logo specified in wgLogo/wgLogoHD exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467412 (https://phabricator.wikimedia.org/T207053) (owner: 10Urbanecm) [11:51:32] zeljkof, please push this directly to prod [11:51:45] there is nothing to test [11:52:18] Urbanecm: 467412 can not be tested? [11:52:29] exactly [11:52:35] ok [11:53:14] (03CR) 10Effie Mouzeli: [V: 032] Added dummy pass for role redis::misc::master [labs/private] - 10https://gerrit.wikimedia.org/r/468310 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:53:23] (03CR) 10Effie Mouzeli: [V: 032 C: 032] Added dummy pass for role redis::misc::master [labs/private] - 10https://gerrit.wikimedia.org/r/468310 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:54:00] !log zfilipin@deploy1001 Synchronized tests/InitialiseSettingsTest.php: SWAT: [[gerrit:467412|Test if logo specified in wgLogo/wgLogoHD exists (T207053)]] (duration: 00m 53s) [11:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:05] T207053: Test if logo specified in wgLogo/wgLogoHD exists - https://phabricator.wikimedia.org/T207053 [11:54:12] Urbanecm: deployed! [11:54:17] thanks! [11:54:36] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@1041a02]: Disable onthisday check - T203588 (duration: 21m 23s) [11:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:39] T203588: Feed checks timeout on RESTBase deploy - https://phabricator.wikimedia.org/T203588 [11:54:54] addshore: swat is yours! [11:54:58] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for kharlan - https://phabricator.wikimedia.org/T207330 (10Dzahn) Since the request is specifically for viewing logs i recommend using the group "mw-log-readers" which was made specifically for this purpose. It gives access to mwlog*... [11:55:21] thanks [11:55:37] (03CR) 10Addshore: [C: 032] Wikidata.org: enable sense data type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468304 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [11:56:10] (03PS2) 10Addshore: Wikidata.org: enable sense data type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468304 (https://phabricator.wikimedia.org/T203888) [11:56:28] (03CR) 10Addshore: [C: 032] Wikidata.org: enable sense data type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468304 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [11:56:41] 10Operations, 10SRE-Access-Requests: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10Dzahn) [11:57:40] (03Merged) 10jenkins-bot: Wikidata.org: enable sense data type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468304 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [11:59:11] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Wikidata.org: enable sense data type T203888 (duration: 00m 54s) [11:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:16] T203888: Turn on Sense support on Wikidata - https://phabricator.wikimedia.org/T203888 [11:59:17] !log SWAT done [11:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:23] zeljkof: all done [11:59:38] addshore: cool! [11:59:48] just in time, like we're Germans ;) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1200) [12:00:20] (03CR) 10Gehel: scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [12:01:46] (03PS2) 10Gehel: admin::user: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468291 [12:03:13] (03CR) 10Mobrovac: scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [12:03:14] (03CR) 10Gehel: admin::user: fix deprecation warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468291 (owner: 10Gehel) [12:04:28] (03CR) 10Gehel: scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [12:05:07] (03CR) 10jenkins-bot: Wikidata.org: enable sense data type [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468304 (https://phabricator.wikimedia.org/T203888) (owner: 10Addshore) [12:05:36] (03PS3) 10Gehel: admin::user: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468291 [12:06:16] (03CR) 10Gehel: [C: 032] admin::user: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468291 (owner: 10Gehel) [12:11:09] 10Operations, 10SRE-Access-Requests: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10Dzahn) I saw we also have "maintenance-log-readers". It allows for access to mwmaint* hosts and reading logs (includes running journcalctl, dmesg, anythin... [12:11:46] PROBLEM - statsd UDP receive errors are elevated on graphite1001 is CRITICAL: 4.452 ge 2 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [12:12:22] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/468285 (owner: 10Gehel) [12:13:01] (03CR) 10Alexandros Kosiaris: [C: 031] Restrict ferm service package_builder_rsync to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/467976 (owner: 10Muehlenhoff) [12:14:12] (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468286 (owner: 10Gehel) [12:14:32] 10Operations, 10SRE-Access-Requests: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10Krenair) How are you going to run mwrepl without the ability to sudo as www-data? [12:15:10] (03CR) 10Gehel: sysctl::conffile: remove deprecation warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468286 (owner: 10Gehel) [12:15:29] (03PS2) 10Gehel: sysctl::conffile: remove deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468286 [12:16:04] (03CR) 10Elukey: "I didn't try to have a single template for both commands since they are (afaics) different and it might become cumbersome to keep everythi" [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [12:16:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Echoing Krinkle. statsd and graphite are completely different protocols, they can not be used interchangeably" [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) (owner: 10Awight) [12:17:01] (03PS2) 10Volans: mediawiki: kill also HHVM on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/467313 (https://phabricator.wikimedia.org/T207014) [12:17:04] 10Operations, 10Wikimedia-Mailing-lists: Make aklapper a co-admin of the list-admins@ mailing list - https://phabricator.wikimedia.org/T207239 (10Dzahn) 05Open>03Resolved a:03Dzahn Done. I logged in at the admin interface using the master password from pwstore and added aklapper@ as an additional admin.... [12:18:01] (03CR) 10Gehel: [C: 032] sysctl::conffile: remove deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468286 (owner: 10Gehel) [12:18:31] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: kill also HHVM on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/467313 (https://phabricator.wikimedia.org/T207014) (owner: 10Volans) [12:18:48] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/467313 (https://phabricator.wikimedia.org/T207014) (owner: 10Volans) [12:19:27] RECOVERY - statsd UDP receive errors are elevated on graphite1001 is OK: (C)2 ge (W)1 ge 0.4719 https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&refresh=1m&panelId=16&fullscreen [12:19:46] (03CR) 10Dzahn: "It works as intended. intention is "run once per hour at a random time within that hour". Not "run every random number of minutes". Are yo" [puppet] - 10https://gerrit.wikimedia.org/r/468002 (owner: 10Paladox) [12:20:13] (03PS2) 10Gehel: exim4: remove deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468285 [12:21:14] (03CR) 10Gehel: [C: 032] exim4: remove deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468285 (owner: 10Gehel) [12:21:30] (03CR) 10Dzahn: [C: 04-2] "per comments above, just to make it clear this should not be merged yet (and separately needs to be amended)" [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [12:22:21] (03Abandoned) 10MGChecker: Allow creation of TemplateStyles in Module namspace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467123 (https://phabricator.wikimedia.org/T200914) (owner: 10MGChecker) [12:22:33] (03PS3) 10Muehlenhoff: Restrict ferm service package_builder_rsync to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/467976 [12:22:41] (03CR) 10Dzahn: [C: 04-1] "still confused whether it's "vi" or "vn"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [12:24:46] (03CR) 10Muehlenhoff: [C: 032] Restrict ferm service package_builder_rsync to domain networks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467976 (owner: 10Muehlenhoff) [12:30:37] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:31:18] 10Operations, 10SRE-Access-Requests: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10Dzahn) Ok. The need for "sudo as www-data" wasn't obvious to me from the request or the linked wikitech page. I see it when looking at actual mwrepl source... [12:32:18] mc1035 again, seems mostly apis.. [12:32:28] should recover soon [12:32:37] 10Operations, 10SRE-Access-Requests: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10Krenair) I think the idea is to prevent/dissuade people running mediawiki code under higher privileged accounts e.g. deployers/ops that could have access t... [12:35:07] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:42:01] (03PS12) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [12:46:17] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:48:15] 10Operations, 10SRE-Access-Requests: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10Krenair) btw, I'm still not convinced about the rules regarding sudo review in access requests. Some groups give users permissions just based on the files... [12:49:31] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) >>! In T88997#4676750, @hashar wrote: >>>! In T88997#4676648, @fgiunchedi wrote: >> Since zuul doesn't seem to use/need global statsd aggre... [12:51:06] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.51 seconds [12:55:14] !log bounce statsd-proxy on graphite1001 [12:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:00] !log installing libssh security updates [12:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:30] 10Operations, 10monitoring, 10Graphite, 10MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), 10MW-1.27-release-notes: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141 (10fgiunchedi) We've been observing periodic elevated (>500/s) udp inerrors / buffer errors... [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1300) [13:02:08] (03PS1) 10Muehlenhoff: Add library hint for libssh [puppet] - 10https://gerrit.wikimedia.org/r/468312 [13:04:29] (03CR) 10Filippo Giunchedi: [C: 04-1] Switch prometheus-ops rsync module to auto_ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467990 (owner: 10Muehlenhoff) [13:05:16] (03CR) 10Muehlenhoff: [C: 032] Add library hint for libssh [puppet] - 10https://gerrit.wikimedia.org/r/468312 (owner: 10Muehlenhoff) [13:07:56] (03CR) 10Volans: [C: 032] mediawiki: kill also HHVM on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/467313 (https://phabricator.wikimedia.org/T207014) (owner: 10Volans) [13:09:30] (03Merged) 10jenkins-bot: mediawiki: kill also HHVM on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/467313 (https://phabricator.wikimedia.org/T207014) (owner: 10Volans) [13:10:04] (03PS13) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [13:16:42] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T207189 (10MoritzMuehlenhoff) [13:16:46] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10MoritzMuehlenhoff) [13:20:35] (03PS3) 10BBlack: interface::rps: always be NUMA aware [puppet] - 10https://gerrit.wikimedia.org/r/467469 [13:20:37] (03PS1) 10BBlack: interface::rps: strict single CPU core per queue [puppet] - 10https://gerrit.wikimedia.org/r/468313 [13:22:06] ACKNOWLEDGEMENT - DPKG on labsdb1009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages Banyek checking on that [13:25:04] ^ It's weird, because when I check those on host, nothing seems broken, but if I set the check for recheck it still red in nagios [13:26:15] banyek: see log [13:26:41] (03PS19) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [13:26:43] (03PS1) 10Alex Monk: certcentral: Add first domain for testing in prod [puppet] - 10https://gerrit.wikimedia.org/r/468315 (https://phabricator.wikimedia.org/T199711) [13:27:06] banyek: let me check, I installed new kernels there for https://phabricator.wikimedia.org/T207377 [13:28:26] mutante (when you're up), https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467011/ assumes the availability of libmonitoring-plugin-perl on all distros, which it isn't on Trusty. [13:28:40] (03CR) 10Filippo Giunchedi: [C: 032] "PCC" [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [13:28:48] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler1002/13051/cloudservices1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [13:28:56] (03PS7) 10Filippo Giunchedi: wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) [13:29:56] (03CR) 10Ottomata: "OOo thanks :/" [puppet] - 10https://gerrit.wikimedia.org/r/468211 (owner: 10Elukey) [13:30:16] mutante: T207387 [13:30:16] T207387: Puppet failures on trusty due to libmonitoring-plugin-perl - https://phabricator.wikimedia.org/T207387 [13:30:46] on labsdb1005 there was the same error, but it went away in a few minutes [13:30:57] (03CR) 10Ottomata: [C: 031] profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:31:19] (03Abandoned) 10Andrew Bogott: base: include ::base::auto_restarts on Trusty instances [puppet] - 10https://gerrit.wikimedia.org/r/468179 (owner: 10Andrew Bogott) [13:31:22] ah, I think I know what happened, I was wondering why the host had jessie packages, then I realised it was dist-upgraded to stretch and ran "aptitude" to display the list of outdated packages [13:32:09] andrewbogott: FYI going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/431595 and disable puppet on cloudservices / labtestweb to apply the patch in steps, no impact expected [13:32:09] and aptitude does an auto-removal, so the python-chardet and python-dateutil were removed (previously pulled in for salt, but now obsolete) [13:32:27] but aptitude didn't remove this properly, but left them in "ri" state in dpkg [13:32:29] fixing that [13:33:01] godog: ok! Will it work on Trusty? (I ask because I've just hit two Trusty breakages in a row) [13:33:06] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:33:27] RECOVERY - DPKG on labsdb1009 is OK: All packages OK [13:33:41] andrewbogott: good question, checking [13:33:48] thx [13:34:37] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 252.05 seconds [13:35:17] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:37:51] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/465612 (owner: 10Volans) [13:37:56] andrewbogott: actually I think none of the hosts this is going to be applied are trusty: cloudservices[1003-1004].wikimedia.org,labtestservices2002.wikimedia.org,labtestweb2001.wikimedia.org,labweb[1001-1002].wikimedia.org [13:38:18] (03CR) 10Volans: [C: 032] PuppetDB: fix regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/465612 (owner: 10Volans) [13:38:28] godog: as long as it isn't landing on cloudservices1001 then that seems fine [13:38:35] or e.g. labcontrol1001 [13:38:54] 10Operations, 10Traffic: Document eqsin power connections in Netbox - https://phabricator.wikimedia.org/T207138 (10faidon) Awesome, thanks! No field for Cable IDs or labels is a bit disappointing :( It doesn't look like we can do it with a custom field either, but I'm not 100% sure. We should file an upstream... [13:39:14] andrewbogott: doesn't look like it no [13:39:24] 'k [13:41:31] (03Merged) 10jenkins-bot: PuppetDB: fix regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/465612 (owner: 10Volans) [13:42:37] ah yes of course ferms bails on the @resolve() AAAA [13:42:49] (03CR) 10jenkins-bot: PuppetDB: fix regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/465612 (owner: 10Volans) [13:44:27] PROBLEM - Check systemd state on labtestservices2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:44:40] (03PS20) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [13:44:55] yeah that's me, used labtestservices to test [13:45:14] I'll revert for now [13:46:13] XioNoX, arturo, I just tried launching a test VM on cloudvirt1018 and it can't reach the network. Would you expect that to work or are we waiting on the switch move before trying to fix this? (I'm OK either way) [13:46:26] (03PS1) 10Filippo Giunchedi: Revert "wmcs: add prometheus-memcached-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/468318 [13:46:45] (03CR) 10Filippo Giunchedi: [C: 032] Revert "wmcs: add prometheus-memcached-exporter" [puppet] - 10https://gerrit.wikimedia.org/r/468318 (owner: 10Filippo Giunchedi) [13:46:55] 10Operations, 10Cloud-Services, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406 (10faidon) 05Open>03Resolved a:03faidon Perfect! As far as I can see, there a few pending tasks, but are or should probably be covered in other tasks. Specificall... [13:47:09] network should already work [13:48:42] ok — they don't, but I'll defer to arturo on details because I need breakfast :) [13:51:18] !log installing libidn security updates [13:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:46] (03PS1) 10Gehel: tlsproxy::localssl: allow mutliple proxies with the same certificate [puppet] - 10https://gerrit.wikimedia.org/r/468320 (https://phabricator.wikimedia.org/T198352) [13:56:23] (03PS21) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [14:01:09] (03Abandoned) 10Paladox: Planet: Fix cron to update feeds [puppet] - 10https://gerrit.wikimedia.org/r/468002 (owner: 10Paladox) [14:02:36] ok will take a look after meeting [14:02:55] (03PS1) 10Gehel: wdqs: extract a custom type for deploy mode [puppet] - 10https://gerrit.wikimedia.org/r/468321 [14:04:07] (03PS22) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) [14:04:58] (03CR) 10Gehel: "puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/13053/" [puppet] - 10https://gerrit.wikimedia.org/r/468321 (owner: 10Gehel) [14:08:32] !log temporarily bump receive socket memory for statsd-proxy on graphite1001 and bounce the service [14:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:08] (03CR) 10Vgutierrez: "pcc looks happy and healthy: https://puppet-compiler.wmflabs.org/compiler1002/13054/" [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:11:36] (03PS1) 10Gehel: prometheus::class_config: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468323 [14:11:56] !log ditto for statsite instances on graphite1001, temporarily bump receive socket memory to 1MB and bounce the service [14:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:17] RECOVERY - Check systemd state on labtestservices2002 is OK: OK - running: The system is fully operational [14:13:54] !log corrections to the statements above, graphite1004 not graphite1001 [14:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:25] (03PS2) 10Gehel: prometheus::class_config: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468323 [14:21:28] !log shutting down mysql and powering down db2042 (T202051) [14:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] T202051: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 [14:22:03] (03PS3) 10Gehel: prometheus::class_config: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468323 [14:22:31] !log begin reformat of ms-be2041 - T199198 [14:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:35] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [14:22:54] 10Operations, 10ops-codfw, 10DBA, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Banyek) [14:23:57] (03PS14) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [14:24:57] (03CR) 10Dzahn: [C: 04-1] "since this is stalled (per ticket comments) for a while, i will remove myself from reviewers. please re-add me once it's ready to merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [14:25:08] (03CR) 10Dzahn: [C: 04-2] "since this is stalled (per ticket comments) for a while, i will remove myself from reviewers. please re-add me once it's ready to merge." [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [14:25:23] your alert v [14:26:06] (03PS4) 10Gehel: prometheus::class_config: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468323 [14:28:16] !log temporarily bump default socket receive memory to 1MB on graphite1001, restart statsd-proxy and statsite [14:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:59] (03PS5) 10Gehel: prometheus::class_config: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468323 [14:31:52] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10ArielGlenn) Let's shoot for Oct 31 then. @ayounsi What time would the window be? @hoo would you prefer stopping and restarting scripts or just skipping the run for the week? [14:34:03] !log remove labvirt1018 from debmonitor (T207317) [14:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:07] T207317: Rename labvirt1018 to cloudvirt1018, move to eqiad1 - https://phabricator.wikimedia.org/T207317 [14:34:40] 10Operations, 10SRE-Access-Requests: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10kaldari) You might as well give kosta `restricted`. He's starting deployment training and will probably be requesting `deployment` by the end of the year.... [14:35:27] PROBLEM - Host db2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:39:27] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [14:40:12] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Papaul) a:05Papaul>03Banyek Disk replacement complete [14:41:56] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) Sounds good. Do you need any help from me to test? I don't have access to `shinken-01.shinken.eqiad.wmflabs` [14:42:27] (03PS1) 10Dzahn: admins: add kharlan to 'restricted' group [puppet] - 10https://gerrit.wikimedia.org/r/468327 (https://phabricator.wikimedia.org/T207330) [14:44:05] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Banyek) 05Open>03Resolved Perfect, thank you! The logical drive is getting rebuilded: ``` Smart Array P420i in Slot 0 (Embedded) array A Logical Drive: 1 Size: 3.3 TB Fa... [14:44:34] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853 (10Qgil) No puppetized at all. It has been installed and it is being maintained manually, following Discourse's own ways to install and update. [14:46:31] !log installing tomcat8 security updates [14:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:26] 10Operations, 10Wikimedia-Mailing-lists: Make aklapper a co-admin of the listadmins@ mailing list - https://phabricator.wikimedia.org/T207239 (10Aklapper) [14:48:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production servers (mwlog*, mmaint* ?) for kharlan - https://phabricator.wikimedia.org/T207330 (10Dzahn) p:05Triage>03Normal [14:49:26] PROBLEM - Host analytics1068.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:47] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) I think we'd just need to disable puppet and change the exim config to use the new host. Although Andrew (?) is working on puppet there at th... [14:50:48] (03PS23) 10BBlack: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:51:11] I cannot connect to analytics1068, the real host iface either [14:51:41] it's been a dead host for a while, I assume someone's working on fixing it or something [14:51:46] RECOVERY - Host db2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms [14:51:55] it routinely has been coming up as the only failure in: cumin '*' 'foo' [14:51:57] for a while now [14:52:26] RECOVERY - Device not healthy -SMART- on db2051 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2051&var-datasource=codfw%2520prometheus%252Fops [14:52:33] I see T203244 [14:52:33] T203244: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 [14:52:45] "analytics1068 is broke...it will not get past loading bios drivers during the post." [14:52:54] maybe chris is unracking it or something [14:53:17] it is handled, so moving on [14:54:31] I am jumping through Dell hoops [14:54:50] (03PS24) 10BBlack: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [14:55:29] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2051 - https://phabricator.wikimedia.org/T207212 (10Marostegui) 05Resolved>03Open Leave it open until it finally gets rebuilt. They fail quite often unfortunately specially on old hosts and they need Papaul or Chris to pull the disk out and then back in [14:56:25] (03PS6) 10Gehel: prometheus::class_config: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468323 [14:57:33] (03CR) 10Jcrespo: [C: 031] icinga: remove check_lonqueries.pl [puppet] - 10https://gerrit.wikimedia.org/r/468066 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:58:31] (03PS2) 10Alex Monk: certcentral: Add first domain for testing in prod [puppet] - 10https://gerrit.wikimedia.org/r/468315 (https://phabricator.wikimedia.org/T199711) [14:59:31] (03CR) 10BBlack: [C: 031] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:00:16] RECOVERY - Host analytics1068.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.79 ms [15:00:24] (03CR) 10Gehel: "puppet compiler agrees this is functionally a noop: https://puppet-compiler.wmflabs.org/compiler1002/13060/" [puppet] - 10https://gerrit.wikimedia.org/r/468323 (owner: 10Gehel) [15:02:34] (03PS5) 10Dzahn: icinga: remove check_lonqueries.pl [puppet] - 10https://gerrit.wikimedia.org/r/468066 (https://phabricator.wikimedia.org/T202782) [15:04:06] PROBLEM - Host db2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:04:29] (03CR) 10Dzahn: [C: 032] icinga: remove check_lonqueries.pl [puppet] - 10https://gerrit.wikimedia.org/r/468066 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [15:04:49] (03PS2) 10Gehel: tlsproxy::localssl: allow mutliple proxies with the same certificate [puppet] - 10https://gerrit.wikimedia.org/r/468320 (https://phabricator.wikimedia.org/T198352) [15:04:51] (03PS15) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [15:07:16] PROBLEM - Host analytics1068.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:08:28] (03CR) 10Gehel: "puppet compiler looks reasonable: https://puppet-compiler.wmflabs.org/compiler1002/13062/" [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [15:09:27] RECOVERY - Host db2042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms [15:12:32] (03CR) 10Gehel: "Puppet compiler agrees this is a noop, but some of those code path are not exercised in our current configuration. Careful review is welco" [puppet] - 10https://gerrit.wikimedia.org/r/468320 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [15:12:56] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) icinga will stop using the mysql module once on stretch. Once T202782 is resolved and einsteinium isn't the prod Icinga server... [15:16:17] (03CR) 10Cwhite: "I think Krinkle and akosiaris are proposing 'sed -i 's/graphite_server/statsd_server/' modules/ores/manifests/web.pp', is that correct?" [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) (owner: 10Awight) [15:19:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Yes that's correct. $graphite_server was a bad name for the variable anyway since as it's obvious on line 52" [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) (owner: 10Awight) [15:23:07] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 495.44 seconds [15:23:08] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 504.55 seconds [15:23:08] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 498.15 seconds [15:23:17] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 501.45 seconds [15:23:42] That is Amir1's script ^ [15:23:53] It is being discussed on -databases [15:23:58] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 520.36 seconds [15:24:25] Just affecting codfw [15:24:50] thanks marostegui :) [15:25:41] (03PS1) 10Gerrit Patch Uploader: Add dty, gor, inh, kbp and lfn to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468337 [15:25:43] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468337 (owner: 10Gerrit Patch Uploader) [15:28:08] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 870.61 seconds [15:28:17] RECOVERY - Host analytics1068.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [15:28:18] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 57.53 seconds [15:28:28] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 24.53 seconds [15:28:38] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [15:28:38] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:29:31] cmjohnson1: o/ - are you working on an1068 by any chance? [15:29:38] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [15:29:43] (I saw a ping down alarm) [15:29:45] i am [15:30:01] super thanks, just wanted to know if it was doing something weird or not :) [15:30:03] but we're not any closer to getting this fixed...Dell is making it very difficult for me [15:30:09] :( [15:30:11] How surprising... [15:30:13] thanks for all the patience [15:30:16] I tried self dispatch again.....hoping that works [15:31:14] I'm going to roll forward with group1 to 1.32.0-wmf.26 since that didn't happen yesterday [15:31:32] (03PS1) 1020after4: group1 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468338 [15:31:34] (03CR) 1020after4: [C: 032] group1 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468338 (owner: 1020after4) [15:32:58] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468338 (owner: 1020after4) [15:33:55] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468338 (owner: 1020after4) [15:35:11] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10GTirloni) Sorry if this was discussed before but I'm seeing the following in messages sent through mx-out01: ``` Return-Path: From... [15:35:38] (03CR) 10Pikne: "Sidenote: people are still trying to adjust the sorting order via meta.wikipedia.org, thoguh apparently it no longer works. So it'd nice i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468337 (owner: 10Gerrit Patch Uploader) [15:35:39] !log twentyafterfour@deploy1001 scap failed: average error rate on 6/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [15:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:01] ugh [15:36:03] Argument 2 passed to Wikibase\Lib\Store\Sql\WikiPageEntityMetaDataLookup::__construct() must implement interface Wikibase\Lib\Store\Sql\PageTableEntityQuery, MediaWiki\Storage\NameTableStore given in /srv/mediawiki/php-1.32.0-wmf.26 [15:36:05] (03PS8) 10Elukey: profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) [15:36:25] rolling back [15:36:33] (03PS1) 1020after4: group1 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468339 [15:36:35] (03CR) 1020after4: [C: 032] group1 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468339 (owner: 1020after4) [15:37:45] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468339 (owner: 1020after4) [15:39:23] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.24 refs T191072 [15:39:58] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) response for HPE hank you so much for updating this case. This is regarding case number: 5333327393. AHS logs is not showing any hard drives. Can you please confirm if the hard drive is located on... [15:40:03] grrr [15:40:17] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.24 refs T191072 (duration: 00m 53s) [15:40:28] twentyafterfour@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:40:29] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [15:41:34] twentyafterfour@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [15:41:45] wth stashbot [15:41:57] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::sqoop_mw: move to timers [puppet] - 10https://gerrit.wikimedia.org/r/468251 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [15:42:00] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.24 refs T191072 (duration: 00m 53s) [15:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) Here is where we are with this server.... - initial order had the wrong raid controller, didn't see all 10 disks - received the new raid controller but then we started getting bad battery errors... [15:48:02] cmjohnson1: is there any chance that this is some kind of horror-movie scenario where Dell and HP are conspiring to drive you into madness? [15:48:51] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468339 (owner: 1020after4) [15:50:19] !log disabling checks on cloudvirt1019 for maintenance [15:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:31] andrewbogott: I wish I was that important! [15:51:51] !log trunk cloud-instances2-b-eqiad between asw-b-eqiad and asw2-b-eqiad [15:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:01] hi! anyone know what the train status/plan is for group1? I heard the train was delayed yesterday... thx!!!! [15:58:17] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.98 seconds [15:59:37] (03PS1) 10Faidon Liambotis: base: move pxz to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/468348 [15:59:39] (03CR) 10Vgutierrez: [C: 031] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/13065/ bblack is happy, we are all happy! \o/ I'm all in to LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [15:59:44] jynus_: ^^ ( https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468348 ) [15:59:48] 10Operations, 10ops-codfw, 10DBA, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Banyek) 05stalled>03Resolved a:03Banyek @Papaul did power drain that fixed the battery status. We tried our spare battery in this host as well (T205257) but it doesn... [16:00:04] godog and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:59] (03CR) 10Jcrespo: [C: 031] base: move pxz to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/468348 (owner: 10Faidon Liambotis) [16:01:21] thx! [16:01:22] 10Operations, 10ops-codfw, 10DBA, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) if it fails again, I suggest we go for a DC failover. [16:01:25] paravoid: the if was moritzm and I trust himg to have a reason [16:01:27] (03CR) 10Faidon Liambotis: [C: 032] base: move pxz to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/468348 (owner: 10Faidon Liambotis) [16:01:29] at the time [16:10:00] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) The logs were sent to HPE [16:10:13] RECOVERY - Device not healthy -SMART- on cloudvirt1019 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1019&var-datasource=eqiad%2520prometheus%252Fops [16:10:29] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) This will require the server to go down for about 20mins [16:11:22] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T202705 (10Cmjohnson) a:05ArielGlenn>03Dzahn @dzahn can you help put the disk back into the raid cfg please [16:13:06] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10Cmjohnson) @ayounsi and @ArielGlenn Oct 31 will be great...can we do 12 or 1pm Eastern? [16:16:01] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10ArielGlenn) That time would be ok for me (my evening but it's not too late). @hoo? [16:18:21] Amir1: anomie I finished all checks on pooled replicas of s8, you are free to continue deploys there [16:18:47] 10Operations, 10Wikimedia-Mailing-lists: Make aklapper a co-admin of the listadmins@ mailing list - https://phabricator.wikimedia.org/T207239 (10Aklapper) Thanks a lot! For the records, I've appended the lines `By sending a message to this list, you email all admins of all lists. To request technical changes... [16:18:49] jynus_: Thanks [16:19:15] jynus_: Thank you [16:21:56] (03CR) 10Mobrovac: scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [16:29:26] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) I agree on the EDD message. Thanks for the pointers, I'll go poke more things shortly. [16:31:28] !log mobrovac@deploy1001 Started deploy [restbase/deploy@6c879fa]: Have 100% of traffic directed to Proton as well - T186748 [16:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:32] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [16:34:08] (03PS25) 10Vgutierrez: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:35:35] (03CR) 10Vgutierrez: [C: 032] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:37:28] ACKNOWLEDGEMENT - HP RAID on cloudvirt1019 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8 - OK: 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207398 [16:39:41] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM! Note that in this case PCC's detection of changes is limited (I think!) because we're dealing with exported resources, thus e.g. the" [puppet] - 10https://gerrit.wikimedia.org/r/468323 (owner: 10Gehel) [16:43:21] (03CR) 10Gehel: "> Patch Set 6: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/468323 (owner: 10Gehel) [16:45:23] PROBLEM - puppet last run on authdns1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:48:33] RECOVERY - HP RAID on db2051 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [16:48:35] (03PS1) 10Alex Monk: certcentral: Try to fix some formatting around the config yaml file [puppet] - 10https://gerrit.wikimedia.org/r/468358 [16:49:27] (03PS9) 10Herron: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [16:49:33] (03PS1) 10Vgutierrez: authdns: Fix certcentral_target keyholder public key path [puppet] - 10https://gerrit.wikimedia.org/r/468359 [16:50:12] (03CR) 10Filippo Giunchedi: [C: 031] Disable prometheus rsyncd module for now [puppet] - 10https://gerrit.wikimedia.org/r/467991 (owner: 10Muehlenhoff) [16:52:20] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@6c879fa]: Have 100% of traffic directed to Proton as well - T186748 (duration: 20m 52s) [16:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:24] T186748: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 [16:55:45] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.26/extensions/WikibaseQualityConstraints/src/ServiceWiring.php: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseQualityConstraints/+/468352/ refs T207394 (duration: 00m 54s) [16:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:48] T207394: Argument 2 passed to Wikibase\Lib\Store\Sql\WikiPageEntityMetaDataLookup::__construct() must implement interface Wikibase\Lib\Store\Sql\PageTableEntityQuery, MediaWiki\Storage\NameTableStore given in /srv/mediawiki/php-1.32.0-wmf.26/extensions/WikibaseQualityConstraints/src/ServiceWiring.php on line 254 - https://phabricator.wikimedia.org/T207394 [16:55:54] (03CR) 10Vgutierrez: [C: 032] authdns: Fix certcentral_target keyholder public key path [puppet] - 10https://gerrit.wikimedia.org/r/468359 (owner: 10Vgutierrez) [16:56:17] (03PS1) 1020after4: group1 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468360 [16:56:19] (03CR) 1020after4: [C: 032] group1 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468360 (owner: 1020after4) [16:56:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Kanban): Decommission labstore100[12] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Bstorm) [16:57:32] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468360 (owner: 1020after4) [16:57:48] !log enabling kafka on logstash elasticsearch cluster T206454 [16:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:51] T206454: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 [16:58:09] (03CR) 10Herron: [C: 032] site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [16:58:13] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10fdans) [16:58:18] (03PS10) 10Herron: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [16:59:16] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.26 refs T191072 [16:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:36] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1700). [17:00:10] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.26 refs T191072 (duration: 00m 53s) [17:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:24] I'm deploying ores (logstash stuff) [17:00:48] no parsoid deploy today [17:00:53] RECOVERY - puppet last run on authdns1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:03:26] (03CR) 10Vgutierrez: [C: 032] certcentral: Try to fix some formatting around the config yaml file [puppet] - 10https://gerrit.wikimedia.org/r/468358 (owner: 10Alex Monk) [17:03:36] (03PS2) 10Vgutierrez: certcentral: Try to fix some formatting around the config yaml file [puppet] - 10https://gerrit.wikimedia.org/r/468358 (owner: 10Alex Monk) [17:04:18] 10Operations, 10DNS, 10GitHub-Mirrors, 10Traffic, and 2 others: Github: add verified domain - https://phabricator.wikimedia.org/T207364 (10Reedy) [17:04:24] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.60 seconds [17:05:03] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 339.70 seconds [17:05:35] (03PS1) 10Andrew Bogott: cloud puppetmaster: allow designate in eqiad1 to clean puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/468361 [17:06:24] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 5.09 seconds [17:06:54] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468360 (owner: 1020after4) [17:07:03] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 16.14 seconds [17:07:34] RECOVERY - Check systemd state on certcentral1001 is OK: OK - running: The system is fully operational [17:07:51] (03PS2) 10Awight: Use the newer statsd name for ORES nodes [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) [17:08:13] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[confluent-kafka],Service[confluent-kafka-connect],Service[confluent-zookeeper] [17:08:41] Krenair: this might not be you but we're seeing cronspam cron certcentral1001: [17:08:55] /bin/systemctl reload certcentral [17:09:02] certcentral.service is not active, cannot reload. [17:09:03] apergos I have no production access. [17:09:14] !log ladsgroup@deploy1001 Started deploy [ores/deploy@4ac4c8b]: Logstash support for ores: T181546 T169586 T168921 T181630 T205256 [17:09:15] but people are working on that host in -traffic [17:09:16] in case you know anything about what that should be doing [17:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:25] T168921: Send error logs to logstash - https://phabricator.wikimedia.org/T168921 [17:09:25] T181630: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630 [17:09:25] T169586: Send celery logs and events to logstash - https://phabricator.wikimedia.org/T169586 [17:09:26] T181546: Let the ORES application set log severity, not uWSGI - https://phabricator.wikimedia.org/T181546 [17:09:26] T205256: ORES uwsgi logs in logstash are useless - https://phabricator.wikimedia.org/T205256 [17:09:35] ok they may already know, thanks [17:09:41] I also think I know what that is apergos [17:09:47] oh! [17:10:02] in that case, I'm all ears! [17:10:13] there's a cron that's supposed to reload the service, except it's known broken in prod right now because we haven't added anything to the config [17:10:25] so the service won't even start [17:10:39] first config entry should go in tomorrow morning hopefully [17:10:59] that's probably what it is [17:10:59] Krenair: latest commit fixed that [17:11:01] until then the cron could be disabled [17:11:09] yeah.. I've disabled it this morning [17:11:18] I just got a spam from it at uh [17:11:22] but running puppet in certcentral1001 re-enabled it [17:11:29] 11 minutes ago [17:11:29] I know I know [17:11:46] yeah if you don't do it in puppet it will just come back... [17:11:58] it's now fixed though [17:11:58] certcen+ 10391 0.1 1.7 95864 36332 ? Ss 17:07 0:00 /usr/bin/python3 /usr/bin/certcentral-backend [17:12:04] certcentral is up and running [17:12:08] ok! [17:12:13] so you shouldn't get any more cronspam [17:12:28] (03PS2) 10Andrew Bogott: cloud puppetmaster: allow designate in eqiad1 to clean puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/468361 [17:12:30] \o/ thanks! [17:13:01] (03CR) 10jerkins-bot: [V: 04-1] cloud puppetmaster: allow designate in eqiad1 to clean puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/468361 (owner: 10Andrew Bogott) [17:13:14] (03PS1) 10Herron: Revert "site: enable logging Kafka on Logstash nodes" [puppet] - 10https://gerrit.wikimedia.org/r/468362 [17:13:52] (03CR) 10jerkins-bot: [V: 04-1] Revert "site: enable logging Kafka on Logstash nodes" [puppet] - 10https://gerrit.wikimedia.org/r/468362 (owner: 10Herron) [17:14:50] (03PS3) 10Andrew Bogott: cloud puppetmaster: allow designate in eqiad1 to clean puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/468361 [17:15:10] (03CR) 10Herron: [C: 032] "reverting because puppet errors out with:" [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [17:15:31] (03CR) 10jerkins-bot: [V: 04-1] cloud puppetmaster: allow designate in eqiad1 to clean puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/468361 (owner: 10Andrew Bogott) [17:15:52] (03PS2) 10Herron: Revert "site: enable logging Kafka on Logstash nodes" [puppet] - 10https://gerrit.wikimedia.org/r/468362 [17:16:13] (03CR) 10Ori.livneh: [C: 031] Enable base::service_auto_restart for nginx on debug proxies [puppet] - 10https://gerrit.wikimedia.org/r/466852 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:16:42] (03CR) 10Herron: [C: 032] Revert "site: enable logging Kafka on Logstash nodes" [puppet] - 10https://gerrit.wikimedia.org/r/468362 (owner: 10Herron) [17:16:50] (03PS3) 10Herron: Revert "site: enable logging Kafka on Logstash nodes" [puppet] - 10https://gerrit.wikimedia.org/r/468362 [17:17:17] e/msg debt [17:18:33] PROBLEM - very high load average likely xfs on ms-be1021 is CRITICAL: CRITICAL - load average: 173.88, 105.34, 51.60 [17:19:22] !log aborted enabling kafka on logstash elasticsearch cluster due to puppet errors. reverted change T206454 [17:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:26] T206454: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 [17:20:17] (03PS4) 10Andrew Bogott: cloud puppetmaster: allow designate in eqiad1 to clean puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/468361 [17:20:24] PROBLEM - puppet last run on logstash1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:21:37] I'm deploying logstash stuff, if this is related, tell me to revert the deploy ^ [17:21:41] godog: akosiaris [17:22:08] (03PS5) 10Andrew Bogott: cloud puppetmaster: allow designate in eqiad1 to clean puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/468361 [17:22:33] PROBLEM - Disk space on ms-be1021 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdi1 is not accessible: Input/output error [17:22:33] PROBLEM - MD RAID on ms-be1021 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [17:22:34] ACKNOWLEDGEMENT - MD RAID on ms-be1021 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T207399 [17:22:38] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1021 - https://phabricator.wikimedia.org/T207399 (10ops-monitoring-bot) [17:22:54] PROBLEM - Check systemd state on ms-be1021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:23:26] (03CR) 10Awight: "@Krinkle thanks for spotting the bad variable name. PS2 includes a fix." [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) (owner: 10Awight) [17:23:33] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:23:39] (03CR) 10Andrew Bogott: [C: 032] cloud puppetmaster: allow designate in eqiad1 to clean puppet certs [puppet] - 10https://gerrit.wikimedia.org/r/468361 (owner: 10Andrew Bogott) [17:23:55] (03PS1) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:24:38] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) Sadly puppet threw these errors after merging https://gerrit.wikimedia.org/r/465167 so reverted with https://ge... [17:25:13] RECOVERY - very high load average likely xfs on ms-be1021 is OK: OK - load average: 23.85, 77.44, 62.31 [17:25:47] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10faidon) Chipping in because I'm not sure if @herron is aware: tools (i.e. Toolforge) has its own (very) special exim configuration. A comparison with... [17:26:03] PROBLEM - swift-container-updater on ms-be1021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:26:24] RECOVERY - Check systemd state on certcentral2001 is OK: OK - running: The system is fully operational [17:28:09] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) [17:29:02] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) [17:30:48] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) Thanks! Subtask created. [17:30:57] (03PS2) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [17:33:00] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@4ac4c8b]: Logstash support for ores: T181546 T169586 T168921 T181630 T205256 (duration: 23m 48s) [17:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:08] T168921: Send error logs to logstash - https://phabricator.wikimedia.org/T168921 [17:33:08] T181630: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630 [17:33:09] T169586: Send celery logs and events to logstash - https://phabricator.wikimedia.org/T169586 [17:33:09] T181546: Let the ORES application set log severity, not uWSGI - https://phabricator.wikimedia.org/T181546 [17:33:10] T205256: ORES uwsgi logs in logstash are useless - https://phabricator.wikimedia.org/T205256 [17:33:20] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13077/" [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [17:33:51] (03CR) 10Smalyshev: [C: 031] wdqs: extract a custom type for deploy mode [puppet] - 10https://gerrit.wikimedia.org/r/468321 (owner: 10Gehel) [17:36:57] (03CR) 10Cwhite: [C: 032] "Output looks good. https://puppet-compiler.wmflabs.org/compiler1002/13071/" [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) (owner: 10Awight) [17:37:06] (03PS3) 10Cwhite: Use the newer statsd name for ORES nodes [puppet] - 10https://gerrit.wikimedia.org/r/468182 (https://phabricator.wikimedia.org/T88997) (owner: 10Awight) [17:37:14] (03PS2) 10Gehel: wdqs: extract a custom type for deploy mode [puppet] - 10https://gerrit.wikimedia.org/r/468321 [17:38:07] (03CR) 10Gehel: [C: 032] wdqs: extract a custom type for deploy mode [puppet] - 10https://gerrit.wikimedia.org/r/468321 (owner: 10Gehel) [17:38:30] (03PS3) 10Gehel: wdqs: extract a custom type for deploy mode [puppet] - 10https://gerrit.wikimedia.org/r/468321 [17:40:35] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10hoo) >>! In T207278#4677853, @ArielGlenn wrote: > That time would be ok for me (my evening but it's not too late). @hoo? Yeah, that would work for me (I think). I guess we can just let the crons... [17:42:40] (03PS7) 10Gehel: prometheus::class_config: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468323 [17:44:44] (03CR) 10Gehel: [C: 032] prometheus::class_config: fix deprecation warnings [puppet] - 10https://gerrit.wikimedia.org/r/468323 (owner: 10Gehel) [17:45:13] godog, herron: just merged that prometheus cleanup ^ [17:45:25] should be all good, but if you see something strange, ping me [17:45:38] rgr that, thanks for the heads up! [17:45:46] RECOVERY - puppet last run on logstash1008 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:56:00] jouncebot: now [17:56:00] For the next 0 hour(s) and 3 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1700) [17:56:03] jouncebot: next [17:56:03] In 0 hour(s) and 3 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1800) [17:58:18] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) I think it depends too on the path the message takes. (i.e. was this sent directly via tools-mail, or was it first relayed through the local... [18:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:28] I'm adding a late addition [18:04:16] (03PS1) 10BBlack: cc authdns integration: use correct env var name [puppet] - 10https://gerrit.wikimedia.org/r/468368 [18:05:43] (03CR) 10Alex Monk: [C: 031] "oops" [puppet] - 10https://gerrit.wikimedia.org/r/468368 (owner: 10BBlack) [18:06:06] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.19 seconds [18:06:07] PROBLEM - MariaDB Slave Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.06 seconds [18:06:17] PROBLEM - MariaDB Slave Lag: s7 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 303.27 seconds [18:06:27] PROBLEM - MariaDB Slave Lag: s7 on db2077 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.46 seconds [18:06:27] PROBLEM - MariaDB Slave Lag: s7 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.44 seconds [18:06:27] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.52 seconds [18:06:45] (03CR) 10BBlack: [C: 032] cc authdns integration: use correct env var name [puppet] - 10https://gerrit.wikimedia.org/r/468368 (owner: 10BBlack) [18:06:57] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.17 seconds [18:07:07] PROBLEM - MariaDB Slave Lag: s7 on db2047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.13 seconds [18:09:29] (03CR) 10Kaldari: [C: 031] admins: add kharlan to 'restricted' group [puppet] - 10https://gerrit.wikimedia.org/r/468327 (https://phabricator.wikimedia.org/T207330) (owner: 10Dzahn) [18:09:49] these lag on codfw is mine [18:11:17] PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 715.34 seconds [18:11:19] (03PS1) 10Elukey: camus: tune eventlogging-client-side config [puppet] - 10https://gerrit.wikimedia.org/r/468369 (https://phabricator.wikimedia.org/T206542) [18:12:04] (03CR) 10Elukey: [V: 032 C: 032] "The job is disabled now but I am updating the conf anyway :)" [puppet] - 10https://gerrit.wikimedia.org/r/468369 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [18:15:01] (03PS1) 10BBlack: certcentral[12]001: static mapped v6 [puppet] - 10https://gerrit.wikimedia.org/r/468370 [18:15:05] (03CR) 10Ottomata: [C: 032] Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [18:15:20] (03PS9) 10Ottomata: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [18:15:44] (03CR) 10Ottomata: [V: 032 C: 032] Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [18:15:46] (03CR) 10BBlack: [C: 032] certcentral[12]001: static mapped v6 [puppet] - 10https://gerrit.wikimedia.org/r/468370 (owner: 10BBlack) [18:16:04] I'll do the SWAT for RoanKattouw's late addition when it's ready [18:16:08] <- failed at racing, must get faster :) [18:16:21] (03PS2) 10BBlack: certcentral[12]001: static mapped v6 [puppet] - 10https://gerrit.wikimedia.org/r/468370 [18:16:25] (03CR) 10BBlack: [V: 032 C: 032] certcentral[12]001: static mapped v6 [puppet] - 10https://gerrit.wikimedia.org/r/468370 (owner: 10BBlack) [18:16:47] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10GTirloni) I think that makes perfect sense (mx1001 vs mx-out01). Let me just add, a recently provisioned Cloud VPS instance has exim4 configured to se... [18:19:04] !log Restarting ORES services for T88997 [18:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:08] T88997: Improve graphite failover - https://phabricator.wikimedia.org/T88997 [18:20:56] PROBLEM - Device not healthy -SMART- on cloudvirt1019 is CRITICAL: cluster=misc device={cciss,6,cciss,7,cciss,8,cciss,9} instance=cloudvirt1019:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudvirt1019&var-datasource=eqiad%2520prometheus%252Fops [18:21:36] PROBLEM - MariaDB Slave Lag: s7 on db1116 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 324.79 seconds [18:28:04] kostajh: I'll deploy your patch [18:28:14] stephanebisson: thank you [18:28:16] (03PS1) 10Mforns: Fix broken default job_class for eventlogging_to_druid_job.pp [puppet] - 10https://gerrit.wikimedia.org/r/468374 (https://phabricator.wikimedia.org/T206342) [18:30:04] (03CR) 10Ottomata: [C: 032] Fix broken default job_class for eventlogging_to_druid_job.pp [puppet] - 10https://gerrit.wikimedia.org/r/468374 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [18:34:19] (03CR) 10Muehlenhoff: "For background: When the original patch was created, precise was still around (but didn't have pxz yet), and I think it was simply merged " [puppet] - 10https://gerrit.wikimedia.org/r/468348 (owner: 10Faidon Liambotis) [18:36:54] (03PS1) 10Andrew Bogott: nova: update scheduling pools for main and eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468377 [18:37:40] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10cscott) [18:38:57] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) note to self, I can merge https://gerrit.wikimedia.org/r/#/c/operations/... [18:44:35] stephanebisson: Still waiting for jerkins? [18:45:33] Reedy: I created the cherry-pick manually and it merged into the wrong place. New patch coming. I think we still have time to do it in this window. [18:52:25] (03PS2) 10Andrew Bogott: nova: update scheduling pools for main and eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468377 [18:52:27] (03PS1) 10Andrew Bogott: nova: add cloudvirt1018 to the eqiad1 scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/468386 [18:53:33] (03CR) 10Andrew Bogott: [C: 032] nova: add cloudvirt1018 to the eqiad1 scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/468386 (owner: 10Andrew Bogott) [18:57:06] (03PS1) 10Cwhite: graphite: add interface::rps settings to graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/468388 (https://phabricator.wikimedia.org/T196484) [18:57:10] kostajh: your change is on mwdebug1001.eqiad.wmnet, can you test? [18:57:18] Yes looking now [19:00:04] twentyafterfour: #bothumor I � Unicode. All rise for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1900). [19:00:30] Sorry, I'm just finishing up SWAT, won't be long [19:00:55] (03PS2) 10Cwhite: graphite: add interface::rps settings to graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/468388 (https://phabricator.wikimedia.org/T196484) [19:01:10] stephanebisson: looks good [19:01:49] kostajh: deploying everywhere... [19:02:22] !log sbisson@deploy1001 Synchronized php-1.32.0-wmf.26/extensions/PageTriage/: SWAT: [[gerrit:468384|Use Main Object Stash for keeping track of PageTriage last use]] (duration: 00m 54s) [19:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:31] voila [19:02:36] SWAT finished [19:04:29] stephanebisson: thanks! [19:04:57] jouncebot: now [19:04:58] For the next 1 hour(s) and 55 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T1900) [19:07:01] (03CR) 10Dzahn: Added new role::redis::misc for general purposes redis servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [19:07:07] RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 273.81 seconds [19:08:04] (03CR) 10Dzahn: Added new role::redis::misc for general purposes redis servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467734 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [19:11:28] !log upping ring buffer size on graphite1004 in an attempt to mitigate dropped packets at the interface -- T196484 [19:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:31] T196484: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 [19:12:10] (03CR) 10Dzahn: [C: 032] "no change but for style reasons these are supposed to be in profiles, not in modules" [puppet] - 10https://gerrit.wikimedia.org/r/467686 (owner: 10Dzahn) [19:12:30] (03PS2) 10Dzahn: install_server: move letsencrypt::cert::integrated to profile [puppet] - 10https://gerrit.wikimedia.org/r/467686 [19:12:41] (03PS15) 10Paladox: Planet: Redesign UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 (https://phabricator.wikimedia.org/T207243) [19:13:17] PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:59] and that looks like it took it offline ^ [19:15:41] shdubsh: let's powercycle it? [19:16:55] i see serial console already in use.. ok [19:16:56] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:34] runs puppet on that [19:17:57] !log rebooting graphite1004 [19:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:37] seems odd that resetting the interface back didn't work, but it's possible after tuning a setting this low a powercycle might be necessary [19:19:54] Does anybody run a something which can cause replication lag across s7? [19:20:47] banyek: apart from a.mir's script that was mentioned before? [19:21:17] RECOVERY - Host graphite1004 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:21:42] I just didn't found it on SAL [19:21:57] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:01] banyek: Is there a specific query being run? [19:22:16] (03PS1) 1020after4: all wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468406 [19:22:18] (03CR) 1020after4: [C: 032] all wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468406 (owner: 1020after4) [19:22:36] (03PS1) 10Dzahn: releases: fix where the inactive warning motd is displayed [puppet] - 10https://gerrit.wikimedia.org/r/468407 [19:22:57] banyek: it was mentioned by maro.stegui earlier today and I see Amir1 mentioning it above [19:23:00] hmm, so rebooting it resets the interface back to default values [19:23:05] reason being that the master in codfw has spinning disks IIRC [19:23:15] (03PS2) 10Dzahn: releases: fix where the inactive warning motd is displayed [puppet] - 10https://gerrit.wikimedia.org/r/468407 [19:23:24] yes, it's not SSD [19:23:33] so sure about it, ptobably I am just having a false positive, but I was curious if I missed something, or it is worth to digging in [19:23:35] that's the reasons it's lagging behind eqiad [19:23:48] when the load is high (by yours truly) [19:23:48] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468406 (owner: 1020after4) [19:23:50] (03CR) 10Dzahn: [C: 032] "as mentioned by Alex in today's Service Ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/468407 (owner: 10Dzahn) [19:23:51] sorry [19:24:01] the only question is db1116 [19:24:21] that is lagging in eqiad [19:25:08] sorry , have to eat something bbl [19:25:39] if it's not a main db, the script doesn't wait for its replication too [19:25:56] which happened on some other hosts as well [19:26:26] yeah, it was a false positive, sorry about it, really [19:33:17] PROBLEM - Host graphite1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:31] ^^ bblack and I are aware [19:34:46] RECOVERY - Host graphite1004 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [19:40:13] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468406 (owner: 1020after4) [19:40:22] (03PS1) 10Bartosz Dziewoński: Fix fetching file descriptions from beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468413 (https://phabricator.wikimedia.org/T206546) [19:41:12] twentyafterfour: All good with hte train? [19:41:19] hellos folks. could someone deploy a beta-only config fix for me (https://gerrit.wikimedia.org/r/468413), or do i need to schedule it for SWAT? [19:43:15] MatmaRex: I can do it [19:44:25] (03CR) 10Ladsgroup: [C: 032] Fix fetching file descriptions from beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468413 (https://phabricator.wikimedia.org/T206546) (owner: 10Bartosz Dziewoński) [19:45:44] (03Merged) 10jenkins-bot: Fix fetching file descriptions from beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468413 (https://phabricator.wikimedia.org/T206546) (owner: 10Bartosz Dziewoński) [19:51:53] (03CR) 10jenkins-bot: Fix fetching file descriptions from beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468413 (https://phabricator.wikimedia.org/T206546) (owner: 10Bartosz Dziewoński) [19:52:13] (03PS3) 10Dzahn: icinga: replace Nagios::Plugin with Monitoring::Plugin in etcd_cluster_health [puppet] - 10https://gerrit.wikimedia.org/r/467015 (https://phabricator.wikimedia.org/T202782) [19:52:15] (03PS1) 10Dzahn: icinga: make icinga user flexible in update-etcd-mw-config-lastindex [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) [19:53:59] (03CR) 10Dzahn: [C: 04-1] "this is executed via NRPE unlike all other converted scripts. so need to first ensure the package is installed on the monitored hosts .. a" [puppet] - 10https://gerrit.wikimedia.org/r/467015 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:55:34] (03PS2) 10Dzahn: icinga: make icinga user flexible in update-etcd-mw-config-lastindex [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) [19:55:36] Amir1: thanks. is it supposed to be deployed now? i'm not seeing the expected effect [19:56:08] (on https://en.wikipedia.beta.wmflabs.org/wiki/File:Test_2018-10-18.png , the link "View on Wikimedia Commons (beta)" should have a HTTPS href rather than proto-relative href) [19:56:09] MatmaRex: it gets deployed automatically, it would take some time [19:56:14] sometimes hours [19:57:00] (03PS2) 10Cwhite: hiera: remove diamond from scb role [puppet] - 10https://gerrit.wikimedia.org/r/466906 (https://phabricator.wikimedia.org/T183454) [19:58:06] RECOVERY - Disk space on notebook1003 is OK: DISK OK [19:58:31] hm, okay. thanks [19:58:37] (03CR) 10Cwhite: [C: 032] hiera: remove diamond from scb role [puppet] - 10https://gerrit.wikimedia.org/r/466906 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [20:00:30] MatmaRex: That and thcipriani is breaking beta ;P [20:01:10] * thcipriani stops breaking beta [20:03:30] twentyafterfour: Are you deploying .26 bump? [20:04:23] can you ping me when it's supposed to be un-broken? [20:14:20] (03PS2) 10Cwhite: hiera: remove diamond on dumps role [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454) [20:16:52] (03CR) 10Cwhite: [C: 032] hiera: remove diamond on dumps role [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [20:16:55] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10pmiazga) From today Chromium-render receives 100% of pdf-render traffic, It works under There 3 remaining bits - [[ https://gerrit.wikimed... [20:18:05] (03CR) 10Volans: [C: 04-1] "This file is installed by icinga::monitor::etcd_mw_config that has no parameters and I don't see the user parameter added there." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:18:56] PROBLEM - Check systemd state on db2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:18:56] PROBLEM - Check whether ferm is active by checking the default input chain on db2042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [20:21:06] !log start ferm on db2042, it failed to start at reboot due to DNS resolution timeout [20:21:07] RECOVERY - Check systemd state on db2042 is OK: OK - running: The system is fully operational [20:21:07] RECOVERY - Check whether ferm is active by checking the default input chain on db2042 is OK: OK ferm input default policy is set [20:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:47] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Jdlrobson) [20:23:20] 10Operations, 10Electron-PDFs, 10Proton, 10Services, and 3 others: Upgrade Puppeteer to 1.9.0 - https://phabricator.wikimedia.org/T207416 (10Jdlrobson) p:05Triage>03High [20:23:55] MatmaRex: I was stopping scap on beta briefly, it's completed now so should be up-to-date [20:25:48] 10Operations, 10Electron-PDFs, 10Proton, 10Epic, and 4 others: [EPIC] New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10Jdlrobson) This is causing a lot of noise in our blocked by others column and devaluing the purpose of it. To me at least this seems like an epic... [20:25:57] 10Operations, 10ops-codfw, 10DBA, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Volans) db2042 failed to start `ferm` at reboot due to a DNS timeout query: ``` Oct 18 15:53:04 db2042 ferm[837]: DNS query for 'prometheus2003.codfw.wmnet' failed: query t... [20:28:57] (03PS2) 10Cwhite: hiera: remove diamond on deployment_server role [puppet] - 10https://gerrit.wikimedia.org/r/466903 (https://phabricator.wikimedia.org/T183454) [20:30:20] 10Operations: ferm fail to start at boot in some cases - https://phabricator.wikimedia.org/T207417 (10Volans) [20:30:42] 10Operations, 10ops-codfw, 10DBA, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Volans) Opened T207417 for the ferm part. [20:31:26] (03CR) 10Cwhite: [C: 032] hiera: remove diamond on deployment_server role [puppet] - 10https://gerrit.wikimedia.org/r/466903 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [20:32:13] (03PS5) 10Urbanecm: Initial configuration for vnwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) [20:32:38] (03PS1) 10Cwhite: Revert "hiera: remove diamond on deployment_server role" [puppet] - 10https://gerrit.wikimedia.org/r/468422 [20:32:55] (03CR) 10Urbanecm: "Vn, code for Vietnam, not Vietnamese. Sorry for the confusion. It should be fixed now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [20:34:00] (03CR) 10Cwhite: [C: 032] Revert "hiera: remove diamond on deployment_server role" [puppet] - 10https://gerrit.wikimedia.org/r/468422 (owner: 10Cwhite) [20:34:25] 10Operations: ferm fail to start at boot in some cases - https://phabricator.wikimedia.org/T207417 (10MoritzMuehlenhoff) That should be https://phabricator.wikimedia.org/T148986, but it's also a generic issue for other system services as well, which also rely on working name resolution. [20:35:32] (03PS3) 10Dzahn: icinga: make icinga user flexible in update-etcd-mw-config-lastindex [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) [20:36:40] Reedy: sorry doing it now [20:37:03] Heh, np [20:37:06] Just wanted to check :) [20:38:44] (03PS4) 10Dzahn: icinga: make icinga user flexible in update-etcd-mw-config-lastindex [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) [20:39:30] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.26 [20:39:31] [W8jvWwpAEMIAAK0Xl-kAAACE] /w/api.php Wikimedia\Rdbms\DBQueryError from line 1496 of /srv/mediawiki/php-1.32.0-wmf.26/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? [20:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:33] Query: INSERT INTO `filearchive` (fa_storage_group,fa_storage_key,fa_deleted_user,fa_deleted_timestamp,fa_deleted,fa_name,fa_archive_name,fa_size,fa_width,fa_height,fa_metadata,fa_bits,fa_media_type,fa_major_mime,fa_minor_mime,fa_timestamp,fa_sha1,fa_deleted_reason,fa_deleted_reason_id,fa_description,fa_description_id,fa_user,fa_user_text) VALUES [20:39:35] ('deleted','6sabbtfos8ymhg4lbf5s0axvg49m1lm.jpg','1592032','20181018203857','0','T.F.Bereznyak_1942_y.JPG',NULL,'56232','185','266','a:11:{s:10:\"ImageWidth\";i:185;s:11:\"ImageLength\";i:266;s:11:\"Compression\";i:5;s:25:\"PhotometricInterpretation\";i:2;s:15:\"SamplesPerPixel\";i:3;s:12:\"RowsPerStrip\";i:22;s:11:\"XResolution\";s:5:\"100/1\";s:11:\"YResolution\";s:5:\"100/1\";s:19:\"PlanarCon [20:39:37] figuration\";i:1;s:14:\"ResolutionUnit\";i:2;s:22:\"MEDIAWIKI_EXIF_VERSION\";i:1;}','8','BITMAP','image','jpeg','20090304064239','6sabbtfos8ymhg4lbf5s0axvg49m1lm','[[COM:DW|Derivative work]] of non-free content ([[COM:CSD#F3|F3]]) - ','22142969','Т.Ф. Березняк 1942',NULL,'572390','Alexcando') [20:39:39] Function: LocalFileDeleteBatch::doDBInserts [20:39:41] Error: 1048 Column 'fa_description_id' cannot be null (10.64.48.23) [20:39:46] ugh [20:40:19] that's a new error I think [20:40:40] (03PS1) 10Bstorm: sonofgridengine: Correct and expand things enough to deploy a grid master [puppet] - 10https://gerrit.wikimedia.org/r/468462 (https://phabricator.wikimedia.org/T200557) [20:40:58] twentyafterfour: Which wiki? [20:41:05] commons [20:41:18] seeing quite a few of that error [20:41:30] 106 since I synced [20:41:33] (03CR) 10Cwhite: [C: 031] icinga: make icinga user flexible in update-etcd-mw-config-lastindex [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:41:35] $fields['fa_description_id'] = [20:41:35] 'CASE WHEN img_description_id = 0 THEN imgcomment_description_id ELSE img_description_id END'; [20:41:37] anomie: ^ [20:43:17] (03CR) 10Volans: icinga: make icinga user flexible in update-etcd-mw-config-lastindex (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:43:42] Reedy: Is there a task filed yet? [20:43:49] anomie: not yet [20:43:52] I was about to do that [20:44:16] was trying to decide if it needs a roll back [20:44:30] 10Operations, 10fundraising-tech-ops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10cwdent) [frack::puppet::private] 527140c add iptables/pfw rules for icinga1001 @Dzahn thanks for the heads up! @ayounsi whenever you have time: 1539894800 [20:44:51] 10Operations, 10fundraising-tech-ops, 10netops: add icinga1001 to send_nsca and pfw rules in FRACK - https://phabricator.wikimedia.org/T207175 (10cwdent) [20:45:03] (03CR) 10Dzahn: icinga: make icinga user flexible in update-etcd-mw-config-lastindex (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:45:26] (03CR) 10Cwhite: [C: 032] "there is another diamond::collector declared in diamond::collector::nagios_lib:" [puppet] - 10https://gerrit.wikimedia.org/r/466903 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [20:45:32] Hmm, error on page deletion on enwiki [20:45:48] so that's a yes to rollback? [20:46:01] just gonna look what the error was [20:46:43] looks like a yes to rollback :/ [20:46:56] enwiki was a Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.64.32.64) [20:47:03] So annoying, but not a big deal on that one [20:47:03] (03PS1) 1020after4: group2 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468465 [20:47:05] (03CR) 1020after4: [C: 032] group2 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468465 (owner: 1020after4) [20:47:09] (03PS5) 10Dzahn: icinga: make icinga user flexible in update-etcd-mw-config-lastindex [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) [20:47:58] anomie: I know it's getting late in the day for you, do you have time to investigate that issue? [20:48:10] (03Merged) 10jenkins-bot: group2 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468465 (owner: 1020after4) [20:48:13] twentyafterfour: I can have a fix in a few minutes. The only confusing thing is why this didn't happen before. [20:48:30] cool, thanks [20:48:30] ok rolling back in the meantime [20:49:50] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.32.0-wmf.24 refs T191072 [20:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:53] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [20:50:27] (03CR) 10jenkins-bot: group2 wikis to 1.32.0-wmf.24 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468465 (owner: 1020after4) [20:51:31] Reedy, twentyafterfour: Do we have a task for me to attack the fix to yet? [20:52:54] anomie: https://phabricator.wikimedia.org/T207419 [20:54:18] (03PS2) 10Bstorm: sonofgridengine: Correct and expand things enough to deploy a grid master [puppet] - 10https://gerrit.wikimedia.org/r/468462 (https://phabricator.wikimedia.org/T200557) [20:55:04] (03PS6) 10Dzahn: icinga: make icinga user flexible in update-etcd-mw-config-lastindex [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) [20:55:06] (03PS1) 10Dzahn: icinga: add puppet types for parameters [puppet] - 10https://gerrit.wikimedia.org/r/468468 [20:58:08] Reedy, twentyafterfour, greg-g: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/468469 [20:58:22] cherry-picked to 1.32.0-wmf.26 [20:58:24] it's already been cherry picked, haha [21:02:56] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:05:16] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:05:36] hmm [21:07:02] 10Operations, 10DBA, 10JADE, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 4 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @Marostegui We've merged the DDL to our repo in order to unblock development, so here ar... [21:10:13] (03CR) 10Gehel: wdqs: cleanup logback configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [21:15:46] anomie: https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-hhvm-docker/4732/console [21:15:51] Reedy: I figured out why it started happening in wmf.26: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/468475 [21:16:04] twentyafterfour, Reedy greg-g -> FYI, there is a patch in current train that is going to increase the traffic to statsd [21:16:14] https://grafana.wikimedia.org/dashboard/db/reading-web-dashboard?orgId=1&panelId=15&fullscreen&edit&tab=general&from=now%2Fd&to=now%2Fd [21:16:31] we have a bucket, and everytime JS errors happens we increment that bucket [21:16:36] twentyafterfour: Weird npm flakiness? [21:16:41] I guess so [21:16:46] as we need to know what is the rough amount of JS errors we have on prod [21:17:02] raynor: how much of an increase? [21:17:09] it's not a blocker, more a heads up to you that there will be a bit more traffic [21:17:25] ok [21:17:31] twentyafterfour -> I don't think we know the increase, we do not have any stats on client-side errors [21:17:42] thats why we want to track it [21:19:08] https://gerrit.wikimedia.org/r/#/c/467760/ -> thats the config change that enabled this feature [21:19:10] (03CR) 10Gehel: scap::target: added additional_services_names param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [21:25:33] (03CR) 10Smalyshev: wdqs: cleanup logback configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [21:26:42] (03CR) 10Gehel: wdqs: cleanup logback configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [21:28:52] (03CR) 10Dzahn: icinga: make icinga user flexible in update-etcd-mw-config-lastindex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:29:20] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.26/includes/filerepo/file/LocalFile.php: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/468470/ refs T207419 (duration: 00m 54s) [21:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:23] T207419: sql error: Error: 1048 Column 'fa_description_id' cannot be null - https://phabricator.wikimedia.org/T207419 [21:31:03] (03PS1) 1020after4: group2 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468479 [21:31:05] (03CR) 1020after4: [C: 032] group2 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468479 (owner: 1020after4) [21:32:11] (03Merged) 10jenkins-bot: group2 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468479 (owner: 1020after4) [21:32:51] Fatal error: entire web request took longer than 200 seconds and timed out in /srv/mediawiki/php-1.32.0-wmf.24/vendor/liuggio/statsd-php-client/src/Liuggio/StatsdClient/StatsdClient.php on line 68 [21:33:09] raynor: ^ [21:33:15] (03PS1) 10Cwhite: diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) [21:33:21] (03PS1) 10Cwhite: keyholder, diamond: remove nagios collector and diamond [puppet] - 10https://gerrit.wikimedia.org/r/468481 (https://phabricator.wikimedia.org/T183454) [21:34:01] (03CR) 10jerkins-bot: [V: 04-1] diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [21:34:34] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13082/" [puppet] - 10https://gerrit.wikimedia.org/r/468414 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:34:37] hmm [21:34:37] RECOVERY - MariaDB Slave Lag: s7 on db1116 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [21:34:52] that might be us [21:35:10] but I also know that PHP likes to timeout in way to many places right now [21:35:42] anomie: is this related to the last error? A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? [21:35:44] Query: INSERT INTO `oldimage` (oi_name,oi_archive_name,oi_size,oi_width,oi_height,oi_bits,oi_timestamp,oi_metadata,oi_media_type,oi_major_mime,oi_minor_mime,oi_sha1,oi_description,oi_description_id,oi_user,oi_user_text) VALUES [21:35:46] ('Muifa_2017_track.png','20181018213322!Muifa_2017_track.png','727878','2700','1669','8','20170429150526','a:6:{s:10:\"frameCount\";i:0;s:9:\"loopCount\";i:1;s:8:\"duration\";d:0;s:8:\"bitDepth\";i:8;s:9:\"colorType\";s:10:\"truecolour\";s:8:\"metadata\";a:1:{s:15:\"_MW_PNG_VERSION\";i:1;}}','BITMAP','image','png','b1xnzhws19ys4sxc9gglry77d73znna','04-29 00Z',NULL,'37289','Meow') [21:35:48] Function: LocalFile::recordUpload2 [21:35:50] Error: 1048 Column 'oi_description_id' cannot be null (10.64.48.23) [21:36:05] looks very similar [21:36:14] also, the error is from wmf.24, our code is in wmf.26 [21:36:40] raynor: it might be spurious I don't know [21:36:57] (03PS2) 10Cwhite: diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) [21:37:23] (03PS3) 10Cwhite: diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) [21:37:29] twentyafterfour: $fields['oi_description_id'] = [21:37:29] 'CASE WHEN img_description_id = 0 THEN imgcomment_description_id ELSE img_description_id END'; [21:37:34] Code also looks similar [21:38:01] (03PS4) 10Cwhite: diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) [21:40:36] ugh so many errors right now I'm having a hard time sorting them all out [21:41:18] anomie: Same fix to apply there? [21:42:29] seems like it's the same error, maybe it didn't sync the fix correctly? I'm confused [21:42:41] I'm gonna try syncing again [21:43:19] (03CR) 10jenkins-bot: group2 wikis to 1.32.0-wmf.26 refs T191072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468479 (owner: 1020after4) [21:43:49] Apparently my eyes are tired [21:43:54] (03PS1) 10Kaldari: Adding TemplateWizard to Beta Features whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) [21:46:02] mine too [21:46:39] !log twentyafterfour@deploy1001 Synchronized php-1.32.0-wmf.26/includes/filerepo/file/LocalFile.php: sync Id97e1c7c2655d90928c777bc3377e5ea23f49f6b refs T207419 (duration: 00m 53s) [21:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:42] T207419: sql error: Error: 1048 Column 'fa_description_id' cannot be null - https://phabricator.wikimedia.org/T207419 [21:47:39] (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [21:48:18] (03CR) 10Samwilson: [C: 031] "I didn't realise this was a thing. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [21:48:54] hmm, my bouncer seems to miss giving me messages if I just suspend my laptop with IRC still running. Fortunately the channel is logged. [21:49:14] Reedy, twentyafterfour: The oldimage thing should have been fixed by the same patch. [21:49:20] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group2 wikis to 1.32.0-wmf.26 refs T191072 [21:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:24] T191072: 1.32.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T191072 [21:50:21] anomie: yeah that was my mixup, apologies for the ping [21:51:11] it appears to be fixed now [21:52:11] !log eeden - manually editing nagios NRPE config and restarting service (to make monitoring from icinga1001 work and puppet is disabled) [21:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:37] I sure would like to know what's up with all of these " Fatal error: entire web request took longer than 60 seconds and timed out in srv/mediawiki/php-1.32.0-wmf.24/includes/parser/Preprocessor_Hash.php on line 184" [21:54:34] (03CR) 10Kaldari: Adding TemplateWizard to Beta Features whitelist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [21:54:58] twentyafterfour: In general, a few weeks back we enabled an HHVM configuration setting to make it time out after 60 seconds. Which kind of screws things up if there are pages that take longer than that to parse. [21:55:28] apparently there are a few [21:59:05] twentyafterfour I'm calling a day but I think that jdlrobson will be around (just in case if something happens with our logging) [21:59:23] raynor: ok thanks [21:59:27] it's midnight here in PL, time to rest :) [21:59:30] looks like everything is stable as of now [21:59:52] ok, thats awesome to hear, I'll check graphs early in the morning [21:59:54] thx [22:00:42] !log lvs1011,lvs1012 - manually editing nagios NRPE config and restarting service (to make monitoring from icinga1001 work and puppet is disabled) [22:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:40] twentyafterfour: https://phabricator.wikimedia.org/T204871 is that issue, hard to pin down, probably just going to have to deal with the long tail of tons of places where this timeout is occuring since it wasn't enforced for a couple years :( [22:07:09] (03CR) 10Kaldari: Adding TemplateWizard to Beta Features whitelist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468483 (https://phabricator.wikimedia.org/T205290) (owner: 10Kaldari) [22:16:59] 10Operations, 10JADE, 10TechCom, 10Epic, and 3 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10awight) [22:32:09] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10monitoring: Releases Jenkins Icinga check failing after restricting access - https://phabricator.wikimedia.org/T206579 (10Dzahn) [22:32:27] PROBLEM - High lag on wdqs1010 is CRITICAL: 6852 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:37:25] 10Operations, 10netops: BGP WARNING - AS15426/IPv4 - https://phabricator.wikimedia.org/T207428 (10Dzahn) [22:37:48] 10Operations, 10netops: cr2-esams - BGP WARNING - AS15426/IPv4 - https://phabricator.wikimedia.org/T207428 (10Dzahn) [22:43:47] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Thank you @cwdent !! [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181018T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:05:56] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [23:11:57] ACKNOWLEDGEMENT - MegaRAID on dbstore1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) daniel_zahn .