[00:34:43] I'm still around, couldn't sleep. logs are still fine [00:57:39] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4230484 (10Krenair) > Separately, some sort of letsencrypt::server class would collect the list of hosts which have applied each of the defined certs, in order... [00:58:29] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#4230486 (10Krenair) Yes, this instance's presence will not be optional in future, it will be needed for things like T194962 (and als... [00:59:47] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-secureredirexperiment puppet error - https://phabricator.wikimedia.org/T191663#4230498 (10Krenair) 05Open>03Resolved a:03Krenair killed it [01:07:36] 10Puppet, 10Beta-Cluster-Infrastructure, 10cloud-services-team: labs-puppetmaster/Labs Puppetmaster HTTPS is UNKNOWN since [...] - https://phabricator.wikimedia.org/T191553#4109867 (10Krenair) It's supposed to have a certificate signed like that, clients of that puppetmaster will trust it as it's added to th... [01:15:02] 10Puppet, 10Beta-Cluster-Infrastructure, 10Shinken, 10cloud-services-team: labs-puppetmaster/Labs Puppetmaster HTTPS is UNKNOWN since [...] - https://phabricator.wikimedia.org/T191553#4230518 (10Krenair) ```root@shinken-01:~# openssl s_client -connect labs-puppetmaster.wikimedia.org:8140 CONNECTED(00000003... [01:19:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [01:19:25] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0 [01:20:32] one router is down [01:26:42] 10Puppet, 10Beta-Cluster-Infrastructure, 10Shinken, 10cloud-services-team: labs-puppetmaster/Labs Puppetmaster HTTPS is UNKNOWN since [...] - https://phabricator.wikimedia.org/T191553#4230537 (10Krenair) Looks like this was broken by @herron in https://gerrit.wikimedia.org/r/#/c/392423/ - check_https_port_... [01:29:05] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/run-no-puppet] [01:32:40] (03PS1) 10Alex Monk: Fix Ifa0b210f: Fix another caller of this function to not break [puppet] - 10https://gerrit.wikimedia.org/r/435075 [01:33:25] (03PS2) 10Alex Monk: Fix Ifa0b210f: Fix another caller of this function to not break [puppet] - 10https://gerrit.wikimedia.org/r/435075 (https://phabricator.wikimedia.org/T191553) [01:34:08] 10Puppet, 10Beta-Cluster-Infrastructure, 10Shinken, 10cloud-services-team, 10Patch-For-Review: labs-puppetmaster/Labs Puppetmaster HTTPS is UNKNOWN since [...] - https://phabricator.wikimedia.org/T191553#4230559 (10Krenair) a:03Krenair [01:38:42] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4230570 (10Krenair) Random upstream problem I noticed while browsing: https://tickets.puppetlabs.com/browse/PUP-8890 [01:41:41] 10Operations, 10Puppet, 10cloud-services-team: Puppet class systemd needs to throw a more useful error - https://phabricator.wikimedia.org/T195553#4230582 (10Krenair) `Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource... [01:56:16] (03PS1) 10Alex Monk: Revert "ircecho: replace base::service_unit with systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435076 [01:57:40] (03PS2) 10Alex Monk: Revert "ircecho: replace base::service_unit with systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) [01:59:25] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:17:48] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet error about missing mysql heartbeat.heartbeat table - https://phabricator.wikimedia.org/T191109#4230621 (10Krenair) [03:02:01] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4230636 (10BBlack) >>! In T194962#4230355, @Krenair wrote: > Anyway, as part of my initial code I made the "oh, it's not issued yet, let's use a self-signed cer... [03:02:51] I'm still around and logs are still fine [03:04:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [03:05:25] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [03:10:39] !log OS install on db209[4-5] [03:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [03:11:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [03:19:27] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4230641 (10Papaul) [03:30:20] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4230643 (10Papaul) [03:35:49] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4230644 (10Papaul) a:05Papaul>03Marostegui @Marostegui it is all yours. The only thing left to do is to add both servers into racktables. I a... [04:54:37] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435107 [04:56:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435107 (owner: 10Marostegui) [04:57:46] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435107 (owner: 10Marostegui) [04:59:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1104 after alter table (duration: 01m 21s) [04:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435107 (owner: 10Marostegui) [05:00:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435108 [05:02:25] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435108 (owner: 10Marostegui) [05:02:28] marostegui: morning [05:03:07] I put up some patches in Wikibase to review and backport, will get it ASAP [05:03:24] Cool [05:03:30] Amir1: morning :) [05:03:47] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435108 (owner: 10Marostegui) [05:03:49] I am greatly surprised, the alter only took 2:40 hours [05:03:54] Thanks SSDs! [05:05:13] !log Add tmp1 index back to db1109 - T194273 [05:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:20] T194273: Clean up indexes of wb_terms table - https://phabricator.wikimedia.org/T194273 [05:05:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1109 for alter table (duration: 01m 20s) [05:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435108 (owner: 10Marostegui) [05:10:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435110 [05:11:45] (03Draft2) 10Biplab Anand: Enable template editor group on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435106 (https://phabricator.wikimedia.org/T195557) [05:12:09] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435110 [05:14:06] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435110 (owner: 10Marostegui) [05:15:46] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435110 (owner: 10Marostegui) [05:16:49] (03CR) 10Muehlenhoff: "This changes sudo permissions and admins groups, please create a separate access request ticket for next week's SRE meeting." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [05:16:52] wait what: the index rebuilt in < 3 hours? really?? [05:17:02] apergos: yeah, very very very surprising [05:17:06] omg that is great [05:17:21] apergos: I guess the SSDs worked a lot better than expected :) [05:17:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 after alter table (duration: 01m 21s) [05:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:48] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435112 (https://phabricator.wikimedia.org/T190148) [05:18:20] (03PS1) 10Ladsgroup: mediawiki: Stop replacing term_search_key with empty string [puppet] - 10https://gerrit.wikimedia.org/r/435113 (https://phabricator.wikimedia.org/T194273) [05:19:30] (03CR) 10Marostegui: [C: 032] mediawiki: Stop replacing term_search_key with empty string [puppet] - 10https://gerrit.wikimedia.org/r/435113 (https://phabricator.wikimedia.org/T194273) (owner: 10Ladsgroup) [05:19:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435112 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:20:58] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435110 (owner: 10Marostegui) [05:21:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435112 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:22:38] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10MW-1.32-release-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)), and 2 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229737 (10Tbayer) Incident report (in pro... [05:23:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 for alter table (duration: 01m 20s) [05:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:15] !log Deploy schema change on db1089 - T190148 T191519 T188299 [05:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:22] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:23:23] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:23:23] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:25:26] !log Stop MySQL on db1116 to copy its content to db1124 - T190704 [05:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:30] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:27:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435112 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:32:29] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4230710 (10MoritzMuehlenhoff) We could also avoid downtime by temporarily reusing mw1298 (former image scaler) and reinstalling it as phab10... [05:35:07] (03PS3) 10Biplab Anand: Enable template editor group on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435106 (https://phabricator.wikimedia.org/T195557) [05:47:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4230744 (10Marostegui) a:05Marostegui>03Papaul The RAID isn't done apparently: ``` root@db2095:~# megacli -LDPDInfo -aAll Adapter #0 Numbe... [05:57:23] !log Add tmp1 index back on dbstore1002 - T194273 [05:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:28] T194273: Clean up indexes of wb_terms table - https://phabricator.wikimedia.org/T194273 [06:28:41] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:31:11] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:12] PROBLEM - puppet last run on mw1307 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:34:53] puppet issues? [06:36:12] RECOVERY - puppet last run on mw1307 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:36:13] I ran puppet again on mw1307 and it worked fine [06:36:17] oh ^ [06:36:28] probably just a quick blip on puppetdb [06:37:55] !log reimage db1065 after raid rebuild [06:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:01] yeah that's pretty standard those blips [06:48:47] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435118 [06:50:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435118 (owner: 10Marostegui) [06:51:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435118 (owner: 10Marostegui) [06:52:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435118 (owner: 10Marostegui) [06:53:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435120 (https://phabricator.wikimedia.org/T190148) [06:53:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 after alter table (duration: 01m 20s) [06:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435120 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:55:38] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435120 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:56:24] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:09] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 for alter table (duration: 01m 20s) [06:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:14] !log Deploy schema change on db1114 - T190148 T191519 T188299 [06:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:20] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [06:57:20] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [06:57:20] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [06:57:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435120 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:58:54] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:05:16] !log stop db1117:m2 to clone it to db1065 [07:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:33] 2 proxies will complain now [07:09:23] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [07:10:04] ls -lh [07:13:13] (03PS4) 10Jayprakash12345: Enable template editor group on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435106 (https://phabricator.wikimedia.org/T195557) (owner: 10Biplab Anand) [07:14:53] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [07:18:54] (03CR) 10Jayprakash12345: [C: 031] Enable template editor group on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435106 (https://phabricator.wikimedia.org/T195557) (owner: 10Biplab Anand) [08:06:10] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [08:06:20] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 [08:06:55] o/ [08:07:44] hope your asleep now Amir1 :P [08:08:58] He went to take a nap an hour ago or so [08:09:26] good good :) [08:14:37] marostegui: how are the indexes going? [08:14:57] addshore: one server was done in 2:40h, the second one is running now, and it has been running for 3 hours now [08:15:23] https://phabricator.wikimedia.org/T194273#4230677 [08:15:31] ack :) [08:16:12] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4227657 (10jcrespo) db1065 storage has been rebuilt and data cloned to it again. However, there is a smart error on the second disk (I think it is #1, as it starts from 0). We need a repla... [08:16:32] I guess I need to look at cleaning up our dirty dirty hacks from last night and come up with a chain of patches for bringing things back [08:17:06] addshore: I don't think we will have all the indexes today, so probably it will need to remain disabled [08:17:21] marostegui: ack! [08:17:58] we were going to leave it disabled anyway, but want to disable things individually and remove the return [] from the main worrysome method, so that we can then turn things on one by one after [08:18:20] Ah sure :) [08:18:23] so, might as well get the individual disabling patches ready and possibly merged now, but leave the return [] [08:18:38] (03PS1) 10Gilles: Enable performance survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435123 (https://phabricator.wikimedia.org/T187299) [08:19:36] (03PS1) 10Marostegui: s1,3,5,6.hosts: Add db1124 [software] - 10https://gerrit.wikimedia.org/r/435124 (https://phabricator.wikimedia.org/T190704) [08:21:47] (03CR) 10Gilles: [C: 032] Enable performance survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435123 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [08:22:25] morning [08:22:45] do we have a log of the zoo of queries against wb_terms now? [08:22:51] (03CR) 10Marostegui: [C: 032] s1,3,5,6.hosts: Add db1124 [software] - 10https://gerrit.wikimedia.org/r/435124 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:23:06] i suspect we are not seeing the ones for PropertySuggester... [08:23:12] (03Merged) 10jenkins-bot: Enable performance survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435123 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [08:23:28] (03CR) 10jenkins-bot: Enable performance survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435123 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [08:23:53] (03Merged) 10jenkins-bot: s1,3,5,6.hosts: Add db1124 [software] - 10https://gerrit.wikimedia.org/r/435124 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [08:24:57] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435125 [08:25:13] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435125 [08:26:18] !log gilles@tin Synchronized wmf-config/InitialiseSettings.php: T187299 Launch performance survey on frwiki (duration: 01m 22s) [08:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:24] T187299: User-perceived page load performance study - https://phabricator.wikimedia.org/T187299 [08:26:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435125 (owner: 10Marostegui) [08:27:56] ooohhh 1109 is done? excellent [08:28:01] marostegui: does the index have a better name now? [08:28:04] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435125 (owner: 10Marostegui) [08:28:16] DanielK_WMDE_: No, I used the same one it used to have for "consistency" XD [08:28:20] heh [08:28:25] DanielK_WMDE_: Don't want more surprises :) [08:28:26] fix on another day [08:28:42] apergos: yeah, took 3:17h for db1109 [08:29:05] nice [08:29:07] marostegui: well, to achieve consistency, we should put that index into the code base. and there it won't be called tmp1. so then we again have a mismatch :/ [08:29:26] anyway... is there a good way to see for dbas if and when an index is being used? [08:29:39] other than running EXPLAIN on a sample query, i mean [08:29:42] why dbas? [08:30:01] aren't index handled by devels? [08:30:16] they are defined by devs [08:30:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435125 (owner: 10Marostegui) [08:30:24] you have performance_Schema installed to check its usage [08:30:28] DanielK_WMDE_: There is a way to see if an index is used (and it was used before removing it) but it is not reliable because it depends on many variables, and this showed, it was in use when we thouhgt it wasn't, on a DB level [08:30:39] jynus: but devs that do not have access to production DBS cannot see what indexes are *actually* being used. [08:30:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1109 after alter table (duration: 01m 20s) [08:30:59] deploying devs do have acces to production dbs [08:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:30] to be able to do a show explain? I know we do, I didn't know deployers do generally [08:32:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435128 (https://phabricator.wikimedia.org/T194273) [08:32:09] and logs show slow queries [08:32:15] jynus: fine. so is there a better way than running exmplain? or a good way to grab example queries for the exmplain, better than putting logging into the code? [08:32:29] it's not abot slow queries. [08:32:40] it'S about fast queries that are fast because they use specific indexes [08:32:49] ...so we know we shouldn't drop that index [08:32:50] performance_schema [08:33:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435128 (https://phabricator.wikimedia.org/T194273) (owner: 10Marostegui) [08:34:22] jynus: thank you. a quick google search tells me that's its own can of worms... https://dev.mysql.com/doc/refman/8.0/en/performance-schema-optimization.html [08:34:30] do we have exampleps / best practices for that somewhere? [08:34:43] its own can of worms? [08:34:51] wtf? [08:34:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435128 (https://phabricator.wikimedia.org/T194273) (owner: 10Marostegui) [08:35:55] jynus: it seems to me like querying the performance schema may itself cause performance issues. you have to know a bit about it to use it right. no? [08:36:09] it causes no performance issues [08:36:28] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435128 (https://phabricator.wikimedia.org/T194273) (owner: 10Marostegui) [08:36:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1092 for alter table (duration: 01m 20s) [08:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:48] !log Add tmp1 back on db1092 - https://phabricator.wikimedia.org/T194273 [08:37:01] marostegui: could you add info to the incident report about why we thought the index wasn't used? it seems to me like that's an important puzzle piece in the question of how to avoid this in the future. [08:37:34] DanielK_WMDE_: As I said, performance_schema isn't 100% reliable when it comes to finding out if the index is unused or not. I will add some info later [08:38:17] so what to use that it is better? [08:38:34] DanielK_WMDE_: The task description also indicated it wasn't used, but we double checked it: https://phabricator.wikimedia.org/T194270 :) [08:38:53] (03PS1) 10Alexandros Kosiaris: Reimage ganeti2004 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/435129 [08:39:35] DanielK_WMDE_: https://www.slideshare.net/jynus/query-optimization-with-mysql-57-and-mariadb-10-even-newer-tricks/97 [08:39:41] there, page 97 of my slide [08:39:42] marostegui: thank you [08:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:13] (03CR) 10Alexandros Kosiaris: [C: 032] Reimage ganeti2004 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/435129 (owner: 10Alexandros Kosiaris) [08:40:17] jynus: while I'https://dev.mysql.com/doc/mysql-perfschema-excerpt/5.7/en/table-io-waits-summary-by-index-usage-table.html [08:40:21] oops. [08:40:31] while i'm waiting for that to load, is this what i'm looking for? https://dev.mysql.com/doc/mysql-perfschema-excerpt/5.7/en/table-io-waits-summary-by-index-usage-table.html [08:40:32] use sys [08:40:37] it is easier [08:40:56] it literally has a table called schema_unused_indexes [08:41:54] (03PS1) 10Jcrespo: mariadb: Reenable db1065 notifications after crash and rebuilding [puppet] - 10https://gerrit.wikimedia.org/r/435130 (https://phabricator.wikimedia.org/T195444) [08:42:33] I also warned about the queries here: https://phabricator.wikimedia.org/T194273#4228564 and set the task to high [08:43:22] maybe a standard could be added so taht queries add expected index/indexes to be used [08:43:27] so it is easier to find them [08:44:45] !log Stop MySQL on db1120 to transfer its content to db1125 - T190704 [08:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:50] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [08:45:40] re [08:45:47] jynus: thank you. [08:45:58] is this info linked on wikitech somewhere? [08:46:09] I think it is [08:46:42] live monitoring and performance analysis of database systems is not something I usually deal with as adeveloper. my knowledge pretty much stops at using EXPLAIN. finding out if and how the indexes i define (or ones that i find dubious) are actually used in production would be usuedful. [08:47:45] literally we enabled performance schema for that [08:48:03] it gives per-file, per query, per user, etc metrics [08:48:17] in real time [08:48:38] it really cannot get better than that [08:48:57] ooooh [08:49:14] of course it is not perfect, but it is THE profiling tool for mysql [08:49:41] jynus: now developers just need to learn that it exists, what it is, and how to use it :) [08:49:41] it is not documented on wikitech, because it is part of the server [08:49:49] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4230856 (10Marostegui) [08:49:55] it is supposed to be just there and being used :-) [08:50:09] morning DanielK_WMDE_ [08:50:17] jynus: most devs, like me, woudn't even know what to ask for. [08:50:17] jynus: it was used [08:50:19] hey addshore [08:50:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435132 [08:50:32] I can train you on that [08:50:39] I have offered that in the past [08:50:41] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435132 [08:50:54] not you, like anyone- mediawiki devs and contributors [08:51:10] I am happy to help with that [08:51:34] let's setup a day and we can run a session [08:51:41] jynus: yes, thank you for that. I'm wondering how to make thiis offer, and the info of how to learn about this stuff, more discoverable. [08:51:52] and we can even tune the metrics to get more if you need, etc. [08:52:05] i mean, if I wasked myself "is this index being used", i may go to wikitech and search for "unused index" [08:52:15] that search doesn't turn up anything useful [08:52:23] ok, then you go and ask me [08:52:25] :-) [08:52:43] hey, I want to drop X but it may be still in use, please show me how to check [08:53:19] and we can do multiple things, p_s, a canary server, pt-query-digest on slow logs, etc. [08:53:41] as far as I know, amir1 did ask marostegui that question. [08:53:52] * DanielK_WMDE_ was not involved in planning or implementing this [08:54:22] DanielK_WMDE_: As I have said multiple times, I checked PS before dropping it, I didn't check for a week, that is true. I gave it 24h [08:55:19] marostegui: i'm not blaming anyone. i think everyone actually did follow best practice. shit still happened. so i'm trying to find holes in the best practice. [08:56:25] DanielK_WMDE_: Yes, the process can be improved indeed [08:57:15] that's all I'm saying :) and i still find it very curious that it didn't blow up right away. that's... sneaky. something unusual must have triggered a pile-on [08:57:38] DanielK_WMDE_: That is a weird part too, because it was running without it for lots of hours [08:58:26] yes. slow fuse. i wonder what happened. [08:58:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435132 (owner: 10Marostegui) [08:58:33] (03PS2) 10Jcrespo: mariadb: Reenable db1065 notifications after crash and rebuilding [puppet] - 10https://gerrit.wikimedia.org/r/435130 (https://phabricator.wikimedia.org/T195444) [08:59:42] DanielK_WMDE_: no idea, but I guess there will be 1 query that suddenly decided to scan the whole table? [08:59:42] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435132 (owner: 10Marostegui) [08:59:56] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435132 (owner: 10Marostegui) [08:59:57] do we have a way of finding that out? / which version of the queries hitting that table it was [09:00:17] there were also holes in the process in how that index (with interesting name 'tmp1') came to exist without further documentation in the first place [09:00:36] indeed :) [09:00:46] mark: yup [09:00:59] did we ever grep though SAL and phab to see if we could find anything? [09:01:15] one thing I'd ask is: when i search for "unused indexes" or "undex usage" or "index performance" or "database performance" on wikitech, i should find a pointer information on performance_schema and how to use it. [09:02:15] * DanielK_WMDE_ now wonders off to deal with the wonderful world of paperwork-needed-when-getting-divorced [09:02:37] DanielK_WMDE_: please add that then :) [09:03:12] mark: the only thing i know about it is that it exists ;) [09:03:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1114 after alter table (duration: 01m 20s) [09:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:45] I will have a look at phab tickets referencing the index [09:04:30] There is an index named tmp1 over term_language, term_type, term_entity_type, term_search_key on that table, which is not in the source (we might want to fix that). <-- from 2014 [09:04:35] heh [09:05:00] comments tossed into the middle of tickets don't become action items.... [09:05:21] (03PS1) 10Marostegui: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435135 (https://phabricator.wikimedia.org/T190148) [09:07:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435135 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:07:52] apergos: heh [09:07:58] https://phabricator.wikimedia.org/T85414 this comment did though, but it was closed as invalid, unclear why [09:08:07] well crap [09:08:21] now let's see if I can find any mention of it being introduced [09:08:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435135 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:08:45] * mark reopens that and sets prio: high [09:08:56] https://phabricator.wikimedia.org/T47529#518889 here [09:09:10] honestly, people are focusing on the index [09:09:19] and I don't think that is the problem [09:09:26] and I don't mean that figuratively [09:09:41] like, "it is the process" [09:09:51] lots of changes where ongoing on that table [09:10:22] and the query optimizer may have decided to change its plan [09:10:27] !log marostegui@tin scap failed: average error rate on 9/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [09:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:48] literally one table column was being blanked [09:10:59] and the query contains like 'b%' [09:11:19] which is a #1 case of "data changes changing the query plan" [09:11:39] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435135 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:11:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435136 [09:12:40] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable db1065 notifications after crash and rebuilding [puppet] - 10https://gerrit.wikimedia.org/r/435130 (https://phabricator.wikimedia.org/T195444) (owner: 10Jcrespo) [09:13:15] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435136 (owner: 10Marostegui) [09:14:15] well I wonder... if it would be worth it at some point, much has been done in cleaning up the schema for all the production servers, whether looking for indices not in the source would be useful [09:14:21] in case there might be other instances of this [09:14:28] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435136 (owner: 10Marostegui) [09:14:30] (03PS3) 10Mobrovac: Proton: Apply the role to proton hosts [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) [09:14:41] apergos: there are thousands of those examples [09:14:52] I hope not!! [09:14:58] I am not kidding [09:15:18] oh my [09:16:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1119 after alter table (duration: 01m 19s) [09:16:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435137 (https://phabricator.wikimedia.org/T190148) [09:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:32] apergos: why do you think I filed https://phabricator.wikimedia.org/T104459 [09:16:35] :-D [09:16:58] because only one or 2 issues :-) [09:17:13] no, and you inherited many many [09:17:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435137 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:17:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435136 (owner: 10Marostegui) [09:17:30] but hopefully only a few of those are index issues? (why yes I am an eternal optimist) [09:17:40] (03CR) 10Mobrovac: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [09:17:48] apergos: There were even PKs involved which we fixed already (at least lots of them) [09:18:01] apergos: And PKs on big tables like revision [09:18:04] I know about at least a few of those, they impacted me :-D [09:18:13] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10MW-1.32-release-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)), and 3 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4230899 (10Vachovec1) Added #wikimedia-inc... [09:18:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435137 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:19:06] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0 [09:19:20] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2687 bytes in 1.610 second response time [09:19:26] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [09:19:48] mmmm [09:20:00] XioNoX: ema: are you looking into that? [09:20:02] maintenance? [09:20:14] there is telia ongoing [09:20:21] looking [09:20:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1105:3311 for alter table (duration: 01m 20s) [09:20:26] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:29] !log Deploy schema change on db1105:3311 - T190148 T191519 T188299 [09:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:35] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [09:20:35] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [09:20:35] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [09:20:42] do we need to depool uslsfo? [09:21:17] uep it's telia scheduled [09:21:19] jynus: yes [09:21:26] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [09:21:28] ok, prepping up the patch [09:21:47] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [09:21:48] akosiaris: thanks [09:22:05] I am checking the impact [09:22:15] uh? [09:22:24] what happened? [09:22:27] I think someone mentioned reroutiing through chicago recently [09:22:30] 4 hour window [09:22:31] (03PS1) 10Alexandros Kosiaris: Depool ulsfo, having issues [dns] - 10https://gerrit.wikimedia.org/r/435139 [09:22:36] both again? [09:22:37] maybe 2 links went down? [09:22:56] yes [09:22:56] (03PS1) 10Ayounsi: Depool ulsfo - both transport are down [dns] - 10https://gerrit.wikimedia.org/r/435140 [09:22:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [09:23:00] both links are down [09:23:02] ffs [09:23:04] both scheduled? [09:23:09] (03CR) 10Alexandros Kosiaris: [C: 032] Depool ulsfo, having issues [dns] - 10https://gerrit.wikimedia.org/r/435139 (owner: 10Alexandros Kosiaris) [09:23:10] https://gerrit.wikimedia.org/r/#/c/435140/ [09:23:13] only 1 [09:23:30] (03CR) 10Ayounsi: [C: 032] Depool ulsfo - both transport are down [dns] - 10https://gerrit.wikimedia.org/r/435140 (owner: 10Ayounsi) [09:23:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1105:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435137 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:23:51] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2701 bytes in 1.636 second response time [09:23:58] icinga shows alerts for both the telia circuit numbers in their maintenance notification [09:23:59] jynus: did you push a dns change? my CR can't be merged [09:24:02] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10MW-1.32-release-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)), and 3 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4230915 (10Marostegui) p:05Unbreak!>03H... [09:24:04] akosiaris, XioNoX:you both merged a different change I think? [09:24:10] judging from the above [09:24:19] or pushed and tried to merge [09:24:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:24:32] I did push and made sure a specific change is indeed on ns hosts [09:24:40] +geoip/generic-map/ulsfo => DOWN [09:24:50] I did not touch dns recently [09:24:59] jynus: crap, I told amir1 he could blank that column. we were under the impression that nothing was using it any more. I didn't double-check. i thought he did... [09:25:04] akosiaris: ok, thx [09:25:07] re-running authdns-update does not show any other changes [09:25:16] DanielK_WMDE_: not now [09:25:24] k [09:25:40] XioNoX: just to be clear, were you asking me because you saw some pending change mine [09:25:47] or just because I mentioned it [09:25:57] in case there was something wrong [09:26:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [09:26:32] jynus: because you mentioned it, didn't know if we did the change at the same time [09:26:34] or maybe the issue was only local to your repo? [09:26:52] I did not do any change, thanked alex his [09:26:52] so, both Telia transport on cr1-uslfo, and Zayo transport on cr2-ulsfo are down [09:27:05] but only telia was scheduled [09:27:07] PROBLEM - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is CRITICAL: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Test scaled pushpin marker with an icon returned the unexpected status 503 (expecting: 200) [09:27:08] we are ok, just to be clear, dns-wise, right? [09:27:09] rught ? [09:27:32] yes [09:27:35] akosiaris: there are serveral one-week long ones [09:27:38] akosiaris: We just got an email from zayo a minute ago, yes [09:27:45] zayo is from 8 hours ago [09:28:01] DNS is fine right now, just to clarify [09:28:04] (outage started then) [09:28:22] akosiaris: thanks [09:28:33] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230923 (10RazeSoldier) [09:29:14] http errors have started going down on varnishes [09:29:14] there is a string of zayo emails indeed; fiber cut [09:29:21] yep [09:29:47] according to graphs it's still quite high though... 50k/min we are not out of the woods yet [09:29:55] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230923 (10jcrespo) Can you retry, it should be solved now or shortly (or may need a refresh of your browser)? [09:30:05] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230923 (10Marostegui) We are on it [09:31:08] (03CR) 10Muehlenhoff: Proton: Apply the role to proton hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434312 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [09:31:16] it is getting close to 0 quickly [09:31:30] it is ? [09:31:36] https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&panelId=7&fullscreen&orgId=1&from=now-15m&to=now ? [09:31:38] getting, not yet there [09:31:44] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230953 (10RazeSoldier) >>! In T195563#4230935, @jcrespo wrote: > Can you retry, it should be solved now or shortly (or may need a refresh of your browser)? Yes, I retry but this problem still exists. [09:31:46] Am I looking at the wrong graph ? [09:32:01] mmm [09:32:02] it's still pretty high.. is something else also going on ? [09:32:08] I am looking at gets [09:32:33] conectivity errors on edges should not generate 503? [09:32:54] enphasis on "?" [09:33:38] jynus: if it is like the last time, we lost connectivity between ulsfo and codfw/eqiad [09:33:47] ah, it's starting to drop now [09:34:02] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 [09:34:03] I think several graphs have different delay [09:34:04] I can see only ulsfo 50x in https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X [09:34:04] seems like the various DNS caches are starting to get the update [09:34:12] but going down [09:34:19] "Yes, I retry but this problem still exists." [09:34:36] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 17389 bytes in 0.455 second response time [09:34:36] RECOVERY - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [09:34:36] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 [09:34:45] it's already 12 mins since I merged the change [09:34:51] what is our ttl, 5 minutes? [09:35:06] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 17388 bytes in 0.457 second response time [09:35:16] ^it may recover before it fully switches :-( [09:35:25] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230955 (10RazeSoldier) It seems that I can visit wikitech but other WMF‘s websites cannot access. [09:35:34] jynus: yes 5 mins [09:35:40] from Varnish-Webrequest-50X the top ips seems bots, so the might not be as responsive as we wish to dns changes? [09:36:03] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [09:36:34] I see less tan 13K/min [09:36:39] to ulsfo [09:36:42] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230962 (10RazeSoldier) I can visit now, but I don't know what happened. [09:37:39] we are now getting double to usual amount of requests, but I guess this is expected [09:37:42] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [09:37:52] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:38:05] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4230963 (10jcrespo) Not your fault, issues with connectivity on certain regions on the world- a datacenter was disabled to workaround it. Thank you for the quick report, it helped! [09:38:59] jynus: unrelated, but fyi db1120's uplink is at ~100% [09:39:11] I 'll refrain from repooling ulsfo until 12:00 UTC [09:39:17] k [09:39:21] the telia window is until then [09:39:40] db1120? [09:39:45] XioNoX: I am transfering from db1120 to db1125 [09:39:46] a, manuel maybe cloning it [09:39:54] akosiaris: it's night there, let's wait to have update on the Zayo link too [09:40:00] you said you didn't want to be notified? [09:40:03] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [09:40:10] we can add you again to the tickets [09:40:24] XioNoX: agreed [09:40:35] +1 [09:40:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:41:00] jynus: nop, no need to notify me :) just saw it while looking at the network's state [09:41:03] akosiaris: it did went mostly to codfw, didn't it? [09:41:10] XioNoX: I didn't notify you directly as you said it was not necessary, I did !log it though :) [09:41:42] marostegui: I should look at those logs first indeed, thx for the reminder [09:41:52] yea, it seems so [09:42:03] jynus: yes [09:42:59] the second time this happens in 2 weeks [09:45:42] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [09:46:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [09:48:12] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [09:54:47] zayo email: "services have been verified as restored." [10:00:17] (03PS1) 10Addshore: Revert "Don't load PropertySuggester" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435147 [10:00:19] (03PS4) 10Mark Bergsma: Use passed-in reactor in all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/434684 [10:00:21] (03PS4) 10Mark Bergsma: Extend unit testing of RunCommand [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 [10:00:23] (03PS4) 10Mark Bergsma: Use MemoryReactorClock for monitor unit tests and adopt tests [debs/pybal] - 10https://gerrit.wikimedia.org/r/434685 [10:00:25] (03PS3) 10Mark Bergsma: Adapt ProxyFetch tests to use tcpClients and sslClients [debs/pybal] - 10https://gerrit.wikimedia.org/r/434695 [10:02:04] (03CR) 10Mark Bergsma: Extend unit testing of RunCommand (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 (owner: 10Mark Bergsma) [10:04:00] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10MW-1.32-release-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)), and 4 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4231052 (10hoo) [10:05:04] (03CR) 10Mark Bergsma: [C: 032] Use passed-in reactor in all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/434684 (owner: 10Mark Bergsma) [10:05:41] (03Merged) 10jenkins-bot: Use passed-in reactor in all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/434684 (owner: 10Mark Bergsma) [10:06:41] PROBLEM - very high load average likely xfs on ms-be1034 is CRITICAL: CRITICAL - load average: 188.34, 107.07, 62.92 [10:07:21] 10Operations, 10Domains, 10Traffic: HTTP 500 on invalid domain - https://phabricator.wikimedia.org/T195568#4231073 (10Tgr) [10:07:26] anyone looking at ^^^ ? [10:14:17] 10Operations, 10monitoring: status.wikimedia.org showing all lights green during major outage - https://phabricator.wikimedia.org/T195530#4231075 (10Peachey88) [10:17:12] I can't even login via SSH due to the high load, will powercycle ms-be1034 [10:18:43] PROBLEM - MD RAID on ms-be1034 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [10:18:43] ACKNOWLEDGEMENT - MD RAID on ms-be1034 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T195569 [10:18:48] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4231083 (10ops-monitoring-bot) [10:19:40] ah, or rather a hardware issue [10:20:31] PROBLEM - Disk space on ms-be1034 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb3 is not accessible: Input/output error [10:20:46] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frpig1001 - https://phabricator.wikimedia.org/T187365#4231102 (10ayounsi) [10:20:49] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#4231103 (10ayounsi) [10:24:08] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4231108 (10MoritzMuehlenhoff) a:03Cmjohnson [10:37:25] !log test force mtu 1400 between cp1074 and cp3039 - T195365 [10:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:30] T195365: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365 [10:44:15] PROBLEM - DPKG on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:45:35] PROBLEM - swift-container-replicator on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:45:35] PROBLEM - swift-object-server on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:45:36] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:45:36] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:45:36] PROBLEM - configured eth on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:45:55] PROBLEM - Check size of conntrack table on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:45:55] PROBLEM - dhclient process on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:05] PROBLEM - swift-object-auditor on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:06] PROBLEM - swift-account-server on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:06] PROBLEM - swift-container-updater on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:06] PROBLEM - swift-account-reaper on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:15] PROBLEM - swift-account-replicator on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:16] PROBLEM - swift-object-replicator on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:25] PROBLEM - swift-container-server on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:25] PROBLEM - swift-account-auditor on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:35] PROBLEM - swift-container-auditor on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:35] PROBLEM - swift-object-updater on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:46:35] PROBLEM - puppet last run on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [10:52:35] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:15] RECOVERY - Host cp3048 is UP: PING WARNING - Packet loss = 80%, RTA = 83.67 ms [10:57:25] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [11:03:25] PROBLEM - Check systemd state on lawrencium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:11:16] PROBLEM - IPMI Sensor Status on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [11:11:35] PROBLEM - HP RAID on ms-be1034 is CRITICAL: Return code of 255 is out of bounds [11:15:40] * akosiaris looking at ms-be1034 [11:17:11] [760661.434084] sd 0:1:0:1: rejecting I/O to offline device [11:17:12] great [11:17:40] I 'll powercycle, box is unresponsive [11:18:58] !log powercycling ms-be1034, box is unresposive, tons of logs "sd 0:1:0:1: rejecting I/O to offline device" [11:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:35] PROBLEM - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% [11:21:33] !log rebalance row_B codfw ganeti nodegroup. Cluster is now fully upgraded to stretch [11:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:14] !log upgrade eqiad ganeti cluster to ganeti 2.15.2-7+deb9u1~bpo8+1 [11:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:25] RECOVERY - dhclient process on ms-be1034 is OK: PROCS OK: 0 processes with command name dhclient [11:22:26] RECOVERY - Check size of conntrack table on ms-be1034 is OK: OK: nf_conntrack is 0 % full [11:22:35] RECOVERY - swift-object-auditor on ms-be1034 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:22:35] RECOVERY - Host ms-be1034 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [11:22:36] RECOVERY - swift-account-server on ms-be1034 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:22:36] RECOVERY - swift-container-updater on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:22:36] RECOVERY - swift-account-reaper on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:22:45] RECOVERY - swift-account-replicator on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:22:46] RECOVERY - swift-object-replicator on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:22:55] RECOVERY - very high load average likely xfs on ms-be1034 is OK: OK - load average: 29.28, 8.17, 2.79 [11:22:55] RECOVERY - DPKG on ms-be1034 is OK: All packages OK [11:22:55] RECOVERY - swift-container-server on ms-be1034 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:22:55] RECOVERY - swift-account-auditor on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:23:05] RECOVERY - swift-container-auditor on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:23:05] RECOVERY - swift-object-updater on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:23:06] RECOVERY - swift-container-replicator on ms-be1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:23:06] RECOVERY - Disk space on ms-be1034 is OK: DISK OK [11:23:06] RECOVERY - swift-object-server on ms-be1034 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:23:10] ok box is up, waiting for nagios to report on the disks [11:23:15] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational [11:23:16] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1034 is OK: OK ferm input default policy is set [11:23:16] RECOVERY - configured eth on ms-be1034 is OK: OK - interfaces up [11:23:25] RECOVERY - MD RAID on ms-be1034 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:23:41] heh, I would have expected some disk failure... [11:25:15] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be1034 is OK: OK: synced at Fri 2018-05-25 11:25:07 UTC. [11:25:22] I thought moritz powercycled that box a little earlier this morning [11:26:36] RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [11:26:55] RECOVERY - IPMI Sensor Status on ms-be1034 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [11:27:04] nothing in sal or my IRC logs [11:27:05] RECOVERY - puppet last run on ms-be1034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:29:00] graphs also don't point anything [11:29:12] in that direction I mean [11:29:46] zayo just said "Not yet hands off. They are currently testing the ONI's" [11:30:17] ic [11:32:12] !log switch to SSH RSA 2048 bit keys for eqiad ganeti intracluster communication [11:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:32] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:15:32] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:15:43] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:15:53] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:26:23] etcd I guess [12:28:02] !log fixed dpkg installation state on mx2001 [12:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:14] RECOVERY - DPKG on mx2001 is OK: All packages OK [12:35:54] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4231327 (10ayounsi) 05Open>03Resolved a:03ayounsi Our San Francisco datacenter is linked to our infrastructure by 2 links. 1 link had a planned maintenance, the other had an outage at the wrong time. as... [12:44:32] (03PS1) 10Alexandros Kosiaris: apertium: Pin python3-tornado to jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/435159 (https://phabricator.wikimedia.org/T194883) [12:47:02] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:49:08] (03CR) 10Addshore: [C: 04-1] "not yet..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435147 (owner: 10Addshore) [12:53:08] (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler02/11277/ says this is fine, but I am not gonna merge on a Friday afternoon. Scheduling this f" [puppet] - 10https://gerrit.wikimedia.org/r/435159 (https://phabricator.wikimedia.org/T194883) (owner: 10Alexandros Kosiaris) [13:00:59] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "ircecho: replace base::service_unit with systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) (owner: 10Alex Monk) [13:01:01] (03PS3) 10Alexandros Kosiaris: Revert "ircecho: replace base::service_unit with systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) (owner: 10Alex Monk) [13:01:03] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "ircecho: replace base::service_unit with systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) (owner: 10Alex Monk) [13:07:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435164 [13:11:33] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435164 (owner: 10Marostegui) [13:13:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435164 (owner: 10Marostegui) [13:13:09] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1105:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435164 (owner: 10Marostegui) [13:13:54] (03CR) 10Muehlenhoff: [C: 031] "Looks good. Given that atop isn't part of a standard Debian install we can remove the atop->purged once puppet has ran successfully across" [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [13:14:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1105:3311 after alter table (duration: 01m 20s) [13:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] !log Add indexes back on s8 codfw primary master (db2045) this will generate lag on codfw - T194273 [13:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:01] T194273: Clean up indexes of wb_terms table - https://phabricator.wikimedia.org/T194273 [13:16:10] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4231406 (10faidon) >>! In T175361#4164353, @herron wrote: > mx2001 has been running Stretch for a few days and has been stable. I think we're in good shape to move on to mx1001. However, t... [13:21:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435166 (https://phabricator.wikimedia.org/T190148) [13:21:29] (03CR) 10Jcrespo: [C: 031] standard_packages: Remove atop from every WMF machine [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [13:23:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435166 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:25:03] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435166 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:26:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1119 for alter table (duration: 01m 20s) [13:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:20] !log  Deploy schema change on db1119 - https://phabricator.wikimedia.org/T190148 https://phabricator.wikimedia.org/T191519 https://phabricator.wikimedia.org/T188299 [13:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:17] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231426 (10Papaul) [13:29:24] (03PS3) 10Faidon Liambotis: standard_packages: Remove atop from every WMF machine [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [13:29:39] (03CR) 10Faidon Liambotis: [C: 032] standard_packages: Remove atop from every WMF machine [puppet] - 10https://gerrit.wikimedia.org/r/428930 (https://phabricator.wikimedia.org/T192551) (owner: 10Jcrespo) [13:29:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435166 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [13:30:12] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:32:12] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:32:22] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:33:03] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:33:42] (03PS1) 10Faidon Liambotis: labs_vmbuilder/bootstrapvz: remove atop [puppet] - 10https://gerrit.wikimedia.org/r/435168 [13:34:12] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:34:22] 10Operations, 10Wikidata, 10Wikimedia-General-or-Unknown, 10MW-1.32-release-notes (WMF-deploy-2018-05-15 (1.32.0-wmf.4)), and 4 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520#4229748 (10Jeff_G) Suggestions for the fut... [13:36:43] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[atop] [13:37:22] PROBLEM - DPKG on pybal-test2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:42:36] (03PS1) 10Muehlenhoff: Remove at [puppet] - 10https://gerrit.wikimedia.org/r/435171 [13:43:53] RECOVERY - DPKG on pybal-test2002 is OK: All packages OK [13:45:11] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231455 (10Papaul) [13:46:32] (03PS1) 10Alexandros Kosiaris: Revert "Depool ulsfo, having issues" [dns] - 10https://gerrit.wikimedia.org/r/435172 [13:46:53] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:47:04] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Depool ulsfo, having issues" [dns] - 10https://gerrit.wikimedia.org/r/435172 (owner: 10Alexandros Kosiaris) [13:47:12] PROBLEM - Device not healthy -SMART- on db2059 is CRITICAL: cluster=mysql device=cciss,11 instance=db2059:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2059&var-datasource=codfw%2520prometheus%252Fops [13:47:58] !log repool ulsfo, links have been stable for quite a few hours [13:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:04] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231464 (10Marostegui) db2094 looking good! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level... [13:57:44] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231513 (10Papaul) a:05Papaul>03Marostegui @Marostegui All done [13:59:37] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4231523 (10RazeSoldier) >>! In T195563#4231327, @ayounsi wrote: > as soon as we noticed the issue, we disabled that datacenter, redirecting the users to a functional datacenter. In this case, is the disabling... [14:00:28] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [14:00:31] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4231526 (10Marostegui) 05Open>03Resolved db2095 looks good now! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [14:01:11] 10Operations, 10Traffic: Error: 503, Backend fetch failed - https://phabricator.wikimedia.org/T195563#4231538 (10ayounsi) Manual. It would be automatic in an ideal world, but not enough resources to work on that. [14:09:57] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4231563 (10herron) A +1 (or other feedback) on https://gerrit.wikimedia.org/r/#/c/429456/ would be a huge help to keep this moving. I'm hesitant to self merge that fleet-wide change. [14:13:42] (03CR) 10ArielGlenn: "There's some lab hosts like labvirt* and so on that are still trusty; it might be worth a ping to see if atd comes up in their stuff." [puppet] - 10https://gerrit.wikimedia.org/r/435171 (owner: 10Muehlenhoff) [14:30:57] (03PS1) 10Ayounsi: Reserve IPs for cr2-ulsfo-cr1-eqdfw tunnel [dns] - 10https://gerrit.wikimedia.org/r/435177 (https://phabricator.wikimedia.org/T195584) [14:33:39] 10Puppet, 10Cloud-VPS, 10MediaWiki-Vagrant: Vagrant -> mwvagrant alias in role::labs::mediawiki_vagrant is brittle - https://phabricator.wikimedia.org/T195592#4231634 (10Tgr) [14:35:51] (03CR) 10Ayounsi: [C: 032] Reserve IPs for cr2-ulsfo-cr1-eqdfw tunnel [dns] - 10https://gerrit.wikimedia.org/r/435177 (https://phabricator.wikimedia.org/T195584) (owner: 10Ayounsi) [14:41:15] twentyafterfour: hi! when you have a sec, I have a question about the version of CentralNotice that has been riding the train... thx in advance! [14:43:13] basically, just trying to figure out why it's at 6492626fc2 rather than the tip of the wmf_depoy branch (5a622c19b978), which is where it normally gets set to [14:48:12] (03CR) 10Herron: [C: 031] "Looks good! Minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) (owner: 10Volans) [14:48:56] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#4231684 (10EddieGP) >>! In T187736#4230486, @Krenair wrote: > (and also it should've been being used for things like SSH host key ga... [15:11:39] 10Operations, 10DBA: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231733 (10Marostegui) [15:13:05] 10Operations, 10DBA: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231749 (10Marostegui) p:05Triage>03Normal [15:13:27] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-secureredirexperiment puppet error - https://phabricator.wikimedia.org/T191663#4231753 (10EddieGP) DNS entries instance-deployment-secureredirexperiment.deployment-prep.wmflabs.org. and *.secureredirtest.wmflabs.org. can probably go away as well? [15:17:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435181 [15:19:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435181 (owner: 10Marostegui) [15:21:21] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435181 (owner: 10Marostegui) [15:23:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1119 after alter table (duration: 01m 20s) [15:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:44] (03PS1) 10Marostegui: sanitarium_multi: Hardcode db1125 server_id [puppet] - 10https://gerrit.wikimedia.org/r/435182 (https://phabricator.wikimedia.org/T195595) [15:31:14] 10Operations, 10DBA, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231814 (10Marostegui) [15:32:21] (03CR) 10Marostegui: [C: 032] sanitarium_multi: Hardcode db1125 server_id [puppet] - 10https://gerrit.wikimedia.org/r/435182 (https://phabricator.wikimedia.org/T195595) (owner: 10Marostegui) [15:36:51] 10Operations, 10DBA, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231822 (10Marostegui) [15:41:32] AndyRussG: the train got held up yesterday because of an unrelated outage so wmf.5 never got pushed [15:41:46] That _might_ be the reason? I'll have to look to confirm [15:42:31] twentyafterfour: thx! don't think so, because it's not the previous head of wmf_deploy branch, either [15:42:51] Hmm, that's odd. [15:42:54] it's like something did get pushed out, but it's a parent commit of the head of wmf_deploy [15:42:56] yeah [15:43:12] I have no idea why that'd be. [15:43:26] I see what you mean, it shouldn't happen like that [15:43:30] In both 4 and 5 branches, I get: $ cat .git/modules/extensions/CentralNotice/HEAD [15:43:32] afcf52a265273c91c32e1f03c408a679a3a67ea9 [15:43:50] yeah that's not quite right, but I can't think of any reason it'd be like that [15:44:07] yeah I've never seen that before... just wanted to figure it out because I'd like to push out a new update on a swatty ploy sometime soon [15:44:20] and just want to make sure I understand current state :) [15:45:21] I definitely haven't touched it. I'd say we should watch for this next week to be sure that make-wmf-branch isn't broken (perhaps it's creating it this way, somehow? I really can't think of a reason why but this is one of the only special-case extensions that don't just get branched weekly like the rest) [15:46:29] correction: it's the only special-case in make-wmf-branch config [15:46:45] "Set a string for the specific commit, branch, or tag. DO NOT USE THIS OR CHAD WILL BE ANGRY", [15:46:52] (03PS1) 10Marostegui: *.hosts: Add db1125 and db2078 [software] - 10https://gerrit.wikimedia.org/r/435184 (https://phabricator.wikimedia.org/T190704) [15:47:11] (03PS1) 10Chad: Adding quota plugin to stable-2.15 fork as well [software/gerrit/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/435185 [15:47:14] no_justification: can you think of a reason that branching behavior might have changed? [15:47:16] (03CR) 10Chad: [V: 032 C: 032] Adding quota plugin to stable-2.15 fork as well [software/gerrit/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/435185 (owner: 10Chad) [15:47:18] see above [15:47:20] 10Operations, 10DBA, 10Patch-For-Review: db1061 (s6 primary master) has a wrong live server_id - needs a MySQL restart - https://phabricator.wikimedia.org/T195595#4231733 (10jcrespo) [15:47:34] I don't touch make-wmf-branch anymore [15:47:42] I've been trying to *replace* it [15:47:47] yeah I don't think anyone has touched it for a while [15:47:57] But CentralNotice has always been a wonky special snowflake [15:48:06] I don't trust that behavior in the script [15:49:10] (03CR) 10Jcrespo: *.hosts: Add db1125 and db2078 (031 comment) [software] - 10https://gerrit.wikimedia.org/r/435184 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [15:49:12] (03CR) 10EddieGP: [C: 031] Fix Ifa0b210f: Fix another caller of this function to not break [puppet] - 10https://gerrit.wikimedia.org/r/435075 (https://phabricator.wikimedia.org/T191553) (owner: 10Alex Monk) [15:49:52] no_justification: twentyafterfour: certainly deserves fixing... Still the form of this snowflake has been for a while that whatever the tip of wmf_deploy is, that's what automagically gets put on the train [15:50:03] (03PS2) 10Marostegui: *.hosts: Add db1125 and db2078 [software] - 10https://gerrit.wikimedia.org/r/435184 (https://phabricator.wikimedia.org/T190704) [15:50:19] and this time it did happen, just that the commit is one before the branch tip [15:50:27] I'd much rather just yank off the freaking band-aid already [15:50:31] And deploy from master [15:50:33] ouch!!! [15:50:34] Like everyone else [15:50:36] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435181 (owner: 10Marostegui) [15:50:47] CentralNotice isn't that special ;-) [15:50:54] hmmmm [15:51:15] The amount of effort we've put into propping up its specialness + /discuss/ one day not being like that..... [15:51:17] well there are times when we really want to keep it stable, especially when the big year-end FR campaigns are happening [15:51:18] * no_justification cries a little [15:51:23] Yeah to maintain the current workflow couldn't you just commit to a dev branch and merge that to master when it's ready to deploy? [15:51:28] Then don't merge broken stuff to master ;) [15:51:38] s/broken/non-stable/ [15:51:39] :) [15:51:42] I don't [15:51:46] Or hide behind feature flags [15:51:52] I mean, I don't merge any broken stuff to master [15:51:56] There's a dozen different ways to keep master stable :) [15:52:19] (03CR) 10Marostegui: [C: 032] *.hosts: Add db1125 and db2078 [software] - 10https://gerrit.wikimedia.org/r/435184 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [15:52:32] some other devs do merge stuff that I haven't gotten to review, and I always just give it a once-over before pushing it out via wmf_deploy [15:52:43] however I sure can undertand ur frustration! [15:53:11] could we remove access to master and let people merge to a dev branch? [15:53:16] Indeed. [15:53:21] (03Merged) 10jenkins-bot: *.hosts: Add db1125 and db2078 [software] - 10https://gerrit.wikimedia.org/r/435184 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [15:53:23] and then just give master access to AndyRussG? [15:53:33] Either we trust people to merge stuff, or we don't :) [15:53:42] hmmmm [15:53:53] I guess I generally prefer the trust-people model [15:54:11] how is that ^ any different from the current system, other than the branch names? [15:54:29] there's gotta be at least one Fab task about this btw, gonna look [15:54:53] (03PS4) 10Herron: ELK: change elasticsearch index prefix to logstash-syslog for syslog type [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) [15:55:26] https://phabricator.wikimedia.org/T113428 [15:55:54] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#4231914 (10EddieGP) These are two out of four remaining trusty instances in deployment-prep, and... [15:56:33] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231916 (10Marostegui) [15:57:17] twentyafterfour no_justification I can ask that we prioritize this for discussion with the rest of fr-tech [15:57:38] seems pretty clear that continuing to snowflake has been more trouble than it's worth [15:57:41] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [15:58:24] we're in a pretty good place in terms of FR campaign timings to fix stuff like this (though there is also other ugent-y stuff on our plate..) [15:58:58] AndyRussG: thank you. I know I've been harping on this for like 2 years... [15:59:16] no_justification: no worries, likewise thanks much and apologies for taking so long [16:01:00] AndyRussG: I'm not on the train next week but I'll remind whoever is to watch out for this so that it doesn't get the wrong commit next week [16:01:03] just for pushing out one change early next week, I guess it's ok to update wmf_deploy as usual and push it out on a SWAT deploy, right? Whoever does the deploy can check the status of stuff on tin at the time, no? [16:01:17] twentyafterfour: ok thanks, I can also be around :) [16:01:35] thx again :) [16:02:25] still not sure if there are normal deploys on Monday (U.S. holiday) (even though they are there on the Deployments page) [16:05:21] twentyafterfour: We *usually* do non-train deploys on US-only holidays [16:05:22] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [16:05:33] (03CR) 10Herron: ELK: change elasticsearch index prefix to logstash-syslog for syslog type (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [16:05:47] I'll be attempting to deploy the train on monday to get us caught up [16:06:24] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/11279/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [16:06:33] (03CR) 10Gehel: [C: 031] ELK: change elasticsearch index prefix to logstash-syslog for syslog type (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431860 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [16:06:36] twentyafterfour: Well, non normal schedule train [16:06:45] Like, we wouldn't do a deploy on a Tuesday we had off :) [16:06:51] herron: ^^^ nice cleanup! [16:07:26] I keep clarifying it until I'm correct ;-) [16:08:18] gehel: thx! sound ok to merge eary next week? [16:08:48] herron: sure! Ping me if you need me for it [16:09:01] awesome sounds like a plan! [16:11:57] twentyafterfour: hmm I imagine your non-normal Monday train deploy would just leave the CN submodule where it is, and I should be around instead for the Tuesday train, no? [16:12:44] (can be around Monday, too, though... If there are swatty ploys on Monday actually I might try to get our wee update pushed out then, I guess...) [16:13:22] AndyRussG: on monday I just plan to push wmf.5 to group2 because it didn't happen yesterday [16:13:34] twentyafterfour: right, gotcha :) [16:14:02] I can also push an update to centralnotice at the same time if you'd like, or just pick any swat you'd like [16:15:00] twentyafterfour: which do you think would be easier/best? [16:15:50] I was gonna ask u around what time, since if I book a swatty after you've updated group2, then the change would only have to go to one branch [16:19:27] AndyRussG: I plan to do it around the normal train window time so ~ noon pacific [16:20:08] twentyafterfour: ah ok cool... Mmmm I should be online around then [16:20:27] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231989 (10Marostegui) 05stalled>03Open The definitive hardware for eqiad is now in place and replicating: db1124: s1, s3,... [16:21:14] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4231995 (10Marostegui) [16:22:03] PROBLEM - Check systemd state on lawrencium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:24:29] thx much! [16:24:39] AndyRussG: you're welcome [16:24:43] :) [16:25:13] RECOVERY - Check systemd state on lawrencium is OK: OK - running: The system is fully operational [16:25:38] :) [16:28:02] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4232018 (10Marostegui) Is the scope of this task finished? [16:28:43] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4232019 (10jcrespo) See T195444#4230827 [16:29:18] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1065 storage crash - https://phabricator.wikimedia.org/T195444#4232025 (10Marostegui) Ah, thanks! missed it :) [16:40:39] (03PS2) 10Andrew Bogott: labs_vmbuilder/bootstrapvz: remove atop [puppet] - 10https://gerrit.wikimedia.org/r/435168 (owner: 10Faidon Liambotis) [16:42:06] (03CR) 10Andrew Bogott: [C: 032] labs_vmbuilder/bootstrapvz: remove atop [puppet] - 10https://gerrit.wikimedia.org/r/435168 (owner: 10Faidon Liambotis) [17:11:41] can someone run this on deploy1001.eqiad.wmnet and paste the result, for https://phabricator.wikimedia.org/T194927 ? apt-cache policy npm [17:13:32] (done in -releng) [17:14:32] there is no "npm" package name there [17:15:11] yeah [17:15:25] seems it got renamed to "node"? [17:16:04] or nodejs [17:16:10] node-* is the other packages related [17:16:47] npm is not in the nodejs package on debian. (it is on the offical nodejs site though) [17:17:37] so there's no stretch packaging of "npm" at all? [17:17:44] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.4.0-1) - https://phabricator.wikimedia.org/T195609#4232208 (10dduvall) [17:17:45] nope [17:17:57] unless you use https://nodejs.org/en/download/package-manager/ [17:17:59] I see related other modules in stretch apt: [17:18:02] node-npm-run-path - Get your PATH prepended with locally installed binaries [17:18:04] node-npmlog - Logger with custom levels and colored output for Node.js [17:18:07] node-pkg-dir - find the root directory of a npm package [17:18:10] but yeah no actual npm package [17:18:41] for nodejs itself: [17:18:43] Candidate: 6.11.0~dfsg-1+wmf1 [17:18:43] Version table: [17:18:43] 8.11.1~dfsg-2~bpo9+1 100 [17:18:43] 100 http://mirrors.wikimedia.org/debian stretch-backports/main amd64 Packages [17:18:46] 6.11.0~dfsg-1+wmf1 1001 [17:18:49] 1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages [17:18:52] 4.8.2~dfsg-1 500 [17:18:54] 500 http://mirrors.wikimedia.org/debian stretch/main amd64 Packages [17:19:19] the world of software is changing [17:22:45] curl https://www.npmjs.org/install.sh | sudo -i sh [17:22:49] who needs packages anymore anyways :) [17:23:28] nnnnnoooooooo [17:23:48] oh god [17:23:53] why [17:23:53] apergos: didn't you see the https in the URL? it's secure, so you don't have anything to fear! [17:24:00] lololol [17:25:01] twentyafterfour: is the train planning to continue today? [17:26:59] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4232221 (10Dzahn) Yes, fully agree. We had already made a very similar plan on IRC, just wasn't sure which specific server to pick. I'll go... [17:30:20] legoktm: no, Monday. [17:30:32] great [17:30:59] (yes it's a holiday, but EU ops will be around and so will mukunda) [17:31:33] well I was otherwise going to revert something out of the branch, so this is better :p [17:31:45] Monday will be Spring Bank Holiday in the UK too :) [17:34:10] Krenair: I said EU :P [17:34:35] but yeah, Greece is also out. [17:34:45] my brexit joke doesn't hold water [17:35:16] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4232230 (10Krenair) >>! In T194962#4230570, @Krenair wrote: > Random upstream problem I noticed while browsing: https://tickets.puppetlabs.com/browse/PUP-8890... [17:42:10] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4232240 (10Dzahn) >>! In T175288#4227190, @Dzahn wrote: > We'll need the database grants above added or that might block the swi... [17:43:46] monday vacation indeed for me [17:52:39] (03CR) 10Dzahn: "the revert broke it on einsteinium it seems" [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) (owner: 10Alex Monk) [17:53:43] (03CR) 10Alex Monk: "Huh, what is the error there? This should be a fairly simple revert..." [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) (owner: 10Alex Monk) [18:02:06] (03CR) 10jerkins-bot: [V: 04-1] ircecho: remove 'restart => true' from base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/435196 (owner: 10Dzahn) [18:02:13] (03CR) 10Dzahn: "Error 500 on SERVER:.. Base::Service_unit[ircecho]: has no parameter named 'restart' at /etc/puppet/modules/ircecho/manifests/init.pp:36 o" [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) (owner: 10Alex Monk) [18:02:55] (03PS1) 10Alex Monk: Fix Icdf33762: Revert "ircecho: Add restart => true to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435197 [18:02:59] (03CR) 10Dzahn: "base::service_unit does not have a parameter named "restart"." [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) (owner: 10Alex Monk) [18:03:02] mutante, bstorm_: https://gerrit.wikimedia.org/r/#/c/435197/ [18:04:12] (03CR) 10Alex Monk: "I8992614d" [puppet] - 10https://gerrit.wikimedia.org/r/435076 (https://phabricator.wikimedia.org/T195552) (owner: 10Alex Monk) [18:04:45] (03PS2) 10Dzahn: ircecho: remove 'restart => true' parameter, unbreak [puppet] - 10https://gerrit.wikimedia.org/r/435196 [18:05:07] surprised git didn't complain about that tbh, it's pretty close to the other commit [18:05:19] (03CR) 10jerkins-bot: [V: 04-1] ircecho: remove 'restart => true' parameter, unbreak [puppet] - 10https://gerrit.wikimedia.org/r/435196 (owner: 10Dzahn) [18:06:01] (03CR) 10Paladox: [C: 031] ircecho: remove 'restart => true' parameter, unbreak [puppet] - 10https://gerrit.wikimedia.org/r/435196 (owner: 10Dzahn) [18:06:03] (03CR) 10Bstorm: [C: 031] Fix Icdf33762: Revert "ircecho: Add restart => true to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435197 (owner: 10Alex Monk) [18:06:17] (03CR) 10Dzahn: "invalid commit message" [puppet] - 10https://gerrit.wikimedia.org/r/435196 (owner: 10Dzahn) [18:07:57] (03PS3) 10Dzahn: ircecho: remove restart => true parameter, unbreak [puppet] - 10https://gerrit.wikimedia.org/r/435196 [18:08:32] (03CR) 10Dzahn: [C: 032] Fix Icdf33762: Revert "ircecho: Add restart => true to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435197 (owner: 10Alex Monk) [18:08:48] (03CR) 10Dzahn: [C: 032] ircecho: remove restart => true parameter, unbreak [puppet] - 10https://gerrit.wikimedia.org/r/435196 (owner: 10Dzahn) [18:09:49] (03Abandoned) 10Alex Monk: Fix Icdf33762: Revert "ircecho: Add restart => true to systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/435197 (owner: 10Alex Monk) [18:09:50] (03CR) 10Dzahn: [C: 032] "duplicate of https://gerrit.wikimedia.org/r/#/c/435196/ sorry i had to pick one and that one the rebase race" [puppet] - 10https://gerrit.wikimedia.org/r/435197 (owner: 10Alex Monk) [18:12:34] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet error about missing mysql heartbeat.heartbeat table - https://phabricator.wikimedia.org/T191109#4232297 (10Krenair) [18:14:22] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:14:32] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393#4232314 (10Petar.petkovic) [18:16:48] (03CR) 10Dzahn: [C: 032] "14:14 < icinga-wm> RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failure" [puppet] - 10https://gerrit.wikimedia.org/r/435196 (owner: 10Dzahn) [18:17:15] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#4232319 (10Krenair) According to the audit log, @joe shut it down 10 Jan 2018, 2:51 p.m. [18:20:22] (03PS1) 10RobH: snapshot1008 partman recipe selected [puppet] - 10https://gerrit.wikimedia.org/r/435202 (https://phabricator.wikimedia.org/T195385) [18:21:29] (03CR) 10RobH: [C: 032] snapshot1008 partman recipe selected [puppet] - 10https://gerrit.wikimedia.org/r/435202 (https://phabricator.wikimedia.org/T195385) (owner: 10RobH) [18:22:44] _joe_, do you remember why deployment-puppetdb was shut down? need to get puppetdb in deployment-prep working again, wanted to check what was up before I power that back up [18:26:09] (03CR) 10Bstorm: [C: 031] Fix Ifa0b210f: Fix another caller of this function to not break [puppet] - 10https://gerrit.wikimedia.org/r/435075 (https://phabricator.wikimedia.org/T191553) (owner: 10Alex Monk) [18:30:03] 10Operations, 10Traffic, 10Wikimedia-Hackathon-2018: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4232346 (10Krenair) I'm going to find out what's going on with puppet DB in T187736, in the mean time my patch for it looked like this (completely untested and... [18:33:46] anyone know about failed login limits on the projects? got someone asking in wikimedia-tech, their blot is blocked [18:34:56] (03PS3) 10Dzahn: Fix Ifa0b210f: Fix another caller of this function to not break [puppet] - 10https://gerrit.wikimedia.org/r/435075 (https://phabricator.wikimedia.org/T191553) (owner: 10Alex Monk) [18:35:26] Brian Wolff might [18:35:30] also Reedy [18:35:57] (03CR) 10Dzahn: [C: 032] Fix Ifa0b210f: Fix another caller of this function to not break [puppet] - 10https://gerrit.wikimedia.org/r/435075 (https://phabricator.wikimedia.org/T191553) (owner: 10Alex Monk) [18:36:03] and uh... tgr / anomie? [18:36:57] it's set via $wgPasswordAttemptThrottle [18:37:15] normally something like 5/min [18:37:32] plus you trigger captchas after a few failed logins [18:37:38] We've had a few bots getting upset by it recently [18:39:37] [ 'count' => 5, 'seconds' => 300 ], [18:39:40] [ 'count' => 150, 'seconds' => 60 * 60 * 48 ], [18:39:51] Many bots seem to be badly coded, and do a loooot of logging in [18:41:03] anyone feel like giving them that explanation in the other channel? [18:41:22] I'm more or less clocked out... [18:41:43] tgr, Reedy: can one of you log in and reset it for an individual user? [18:42:03] No [18:42:11] Because working out wtf the key is is a bastard [18:42:11] (03PS4) 10Thcipriani: Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936) [18:42:28] See also https://phabricator.wikimedia.org/T194506 [18:42:38] just create a Throttler, it's not that hard [18:42:43] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.091 second response time [18:43:16] anyway let's follow up in -tech [18:43:54] wait, the task is about account creation limits, is that the same issue? [18:44:43] No, it's not [18:44:48] But building the cache key is probably similarly confusing [18:44:51] And hard to mitigate [18:45:00] wikitech docs is broken [18:47:22] Reedy: need to go afk for a few minutes, will add the correct method to the task afterwards [18:47:23] Reedy: It's not that hard if you just find the log entry in Kibana. Although I note the log entry tries to use 'ip' as a key, that apparently is getting overwritten by logstash at some later stage. [18:47:41] Cheers. And update wikitech too ;) [18:48:04] jem, Just FYI: they're talking about the problem above (from 19:33 UTC). When they have a solution, they will reply in -tech channel or the phab task. :) [18:48:54] $cache->makeGlobalKey( 'throttler', $throttle, $index, $ip, md5( $username ) ) where all the variables come from the log entry. Except, as noted, $ip is being overwritten so you have to extract it from 'message' instead. [18:49:33] https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold [18:49:34] Says [18:49:34] mwscript mcc.php --wiki=$wiki [18:49:34] > delete $wiki:acctcreate:ip:$ip [18:49:34] > exit [18:49:39] Which is clearly wrong :) [18:49:41] (now) [18:50:08] Reedy: That looks like account creation rather than login. But yes, it's not right. [18:50:14] I know it is [18:50:54] It was mostly in response to [18:50:54] [19:42:36] just create a Throttler, it's not that hard [18:50:55] ;) [18:52:03] 10Operations, 10ContentTranslation, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments: Migrate apertium to SCB - https://phabricator.wikimedia.org/T147288#4232429 (10Petar.petkovic) [18:53:54] 10Operations, 10ContentTranslation, 10ContentTranslation-Deployments, 10Language-Team, and 2 others: Common Database for Content Translation in Beta - https://phabricator.wikimedia.org/T1254#4232434 (10Petar.petkovic) [18:54:03] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation: Rack and setup snapshot1008 - https://phabricator.wikimedia.org/T195385#4232435 (10RobH) a:05RobH>03ArielGlenn [18:54:21] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation: Rack and setup snapshot1008 - https://phabricator.wikimedia.org/T195385#4225718 (10RobH) All yours @ArielGlenn [18:56:38] any chance anyone could do a maint script dry run for me today? https://phabricator.wikimedia.org/T195546 [18:56:55] over all wikis? [18:57:02] wmf.5 wikis [18:57:15] ugh [18:57:22] how do we calculate that? :P [18:57:24] heh [18:57:36] okay, it can be done on monday after the deployment [18:57:37] dblist maths? :P [18:57:49] I can run it over specific group0-3 dblists [18:57:57] We have a group 3? [18:58:04] off by one error [18:59:07] MatmaRex: I can do group0 and group1 at least for now [19:00:01] Reedy: it wouldn't hurt to run it across all wikis, actually, if that's easier [19:00:07] yeah, ti is [19:00:10] but there will probably be no results for the wmf.4 wikis [19:00:20] i mean, hopefully there will be no results [19:00:23] but you never know [19:00:40] it would be good to learn if there are [19:00:49] running [19:02:03] Reedy: dry run, right? [19:02:06] Reedy: i don't know how long it'll take btw [19:02:07] yes [19:02:14] a while? :P [19:03:53] i guess a few hours at worst [19:06:11] 10Blocked-on-Operations, 10ContentTranslation, 10ContentTranslation-Deployments, 10ContentTranslation-Release3, 10LE-Sprint-81: Create Database config for Content Translation in Production - https://phabricator.wikimedia.org/T78775#4232512 (10Petar.petkovic) [19:08:38] 10Blocked-on-Operations, 10ContentTranslation, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, and 2 others: Separate config for Beta and Production for CXServer - https://phabricator.wikimedia.org/T88793#4232517 (10Petar.petkovic) [19:10:34] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4232526 (10EddieGP) [19:13:58] 10Puppet, 10Beta-Cluster-Infrastructure, 10Shinken, 10cloud-services-team, 10Patch-For-Review: labs-puppetmaster/Labs Puppetmaster HTTPS is UNKNOWN since [...] - https://phabricator.wikimedia.org/T191553#4232546 (10EddieGP) 05Open>03Resolved Per the link in the task description `labs-puppetmaster/Lab... [19:15:29] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Host deployment-puppetdb01 is DOWN: CRITICAL - Host Unreachable (10.68.23.76) - https://phabricator.wikimedia.org/T187736#4232550 (10EddieGP) Apparently the reason was just "unused": https://tools.wmflabs.org/sal/log/AWDgixgYwg13V6286cnS [19:27:19] MatmaRex: 63% the way through arwiki [19:27:22] Gonna take a while [19:28:12] Reedy: :o [19:28:30] Reedy: sounds like that script should have a larger batch size… oh well [19:28:50] Yeah, pretty sure it could have more [19:29:33] 500 or 1000 [19:35:25] (03PS1) 10Dzahn: rename wmf6937 from mw1298 to phab1002 [dns] - 10https://gerrit.wikimedia.org/r/435211 (https://phabricator.wikimedia.org/T190568) [19:42:29] <_joe_> Krenair: nope, no idea. Sorry, I've been off today and I'm still not really around, but I don't remember helping set it up even [19:43:05] <_joe_> Krenair: you might want to ask herron for help, he was involved in the last puppetdb upgrade much more than I was [19:43:13] <_joe_> so he might know more now [19:50:42] Krenair: I’m not sure about the history of it but for sure can help with standing it up [19:50:46] maybe we can connect next week? [19:52:36] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#4232664 (10Dzahn) 15:33 15:31 <@James_F> mutante: Will the feed URL stay the same? If so, most won't notice the switchover. 15:33 15:33 < mutante> James_F:... [19:53:43] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568#4232666 (10mmodell) +1 this sounds like a good plan. [20:01:05] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#4232684 (10Dzahn) @Jdforrester-WMF I should have used T180498. That is the ticket for replacing the software. This here was just for the actual server upgrade .. then we ran i... [20:01:40] (03Draft1) 10Paladox: Planet: Replace rss20.xml with atom.xml (backwards compat filename) [puppet] - 10https://gerrit.wikimedia.org/r/435218 [20:01:42] (03PS2) 10Paladox: Planet: Replace rss20.xml with atom.xml (backwards compat filename) [puppet] - 10https://gerrit.wikimedia.org/r/435218 [20:01:58] (03PS3) 10Paladox: Planet: Replace rss20.xml with atom.xml (backwards compat filename) [puppet] - 10https://gerrit.wikimedia.org/r/435218 (https://phabricator.wikimedia.org/T168490) [20:02:34] that affects rawdog only [20:06:59] (03CR) 10Dzahn: "Is this feed actually an Atom or an RSS 2.0 feed format?" [puppet] - 10https://gerrit.wikimedia.org/r/435218 (https://phabricator.wikimedia.org/T168490) (owner: 10Paladox) [20:08:20] (03CR) 10Dzahn: "https://en.wikipedia.org/wiki/Atom_(Web_standard)#Atom_compared_to_RSS_2.0" [puppet] - 10https://gerrit.wikimedia.org/r/435218 (https://phabricator.wikimedia.org/T168490) (owner: 10Paladox) [20:17:00] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#4232707 (10JMinor) [20:22:09] 10Operations, 10hardware-requests: request to assign wmf6937 (mw1298, former imagescaler) as phab1002 - https://phabricator.wikimedia.org/T195623#4232722 (10Dzahn) [20:23:54] 10Operations, 10hardware-requests: request to assign wmf6937 (mw1298, former imagescaler) as phab1002 - https://phabricator.wikimedia.org/T195623#4232736 (10Dzahn) mw1298: https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=3004 phab1001: https://racktables.wikimedia.org/index.php?pag... [20:25:04] (03PS2) 10Dzahn: rename wmf6937 from mw1298 to phab1002 [dns] - 10https://gerrit.wikimedia.org/r/435211 (https://phabricator.wikimedia.org/T190568) [20:26:08] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign wmf6937 (mw1298, former imagescaler) as phab1002 - https://phabricator.wikimedia.org/T195623#4232753 (10Dzahn) p:05Triage>03Normal [20:33:49] !log LDAP: added user wmde-leszek to group 'nda' (T195358) [20:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:55] T195358: Add WMDE-leszek to the ldap/nda group - https://phabricator.wikimedia.org/T195358 [20:42:06] (03CR) 10Dzahn: "Ladsgroup's user name is capitalized" [puppet] - 10https://gerrit.wikimedia.org/r/434479 (https://phabricator.wikimedia.org/T195289) (owner: 10ArielGlenn) [20:45:35] 10Operations, 10ops-codfw, 10DBA: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4232792 (10Marostegui) [20:45:51] 10Operations, 10ops-codfw, 10DBA: db2059 disk on predictive failure - https://phabricator.wikimedia.org/T195626#4232804 (10Marostegui) p:05Triage>03Normal [20:46:49] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2059 is CRITICAL: cluster=mysql device=cciss,11 instance=db2059:9100 job=node site=codfw Marostegui T195626 - The acknowledgement expires at: 2018-05-31 20:46:30. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2059&var-datasource=codfw%2520prometheus%252Fops [20:49:31] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289#4222037 (10Dzahn) Ladsgroup has agreed on Gerrit. Though i note that the spelling of his Icinga co... [20:49:54] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289#4232818 (10Dzahn) p:05Triage>03Normal [20:54:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435284 [20:54:36] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1092" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435284 [20:55:00] (03CR) 10Marostegui: [C: 04-2] "wait for the alter to finish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435284 (owner: 10Marostegui) [21:04:20] greg-g: I'd like to deploy https://gerrit.wikimedia.org/r/#/c/435283/ soon-ish [21:07:41] legoktm: what's going on? [21:08:00] "I've been told" is a weird justification I can't trace [21:08:46] sorry, there was discussion on https://gerrit.wikimedia.org/r/#/c/421199/ and thedj pointed out how it wasn't announced properly [21:09:30] I've put in a note in tech news, and I'll send something out to wikitech-ambassadors [21:09:34] heh [21:09:48] legoktm: should proably tag T195625 in the commit message [21:09:48] T195625: Implement a responsive layout for MonoBook - https://phabricator.wikimedia.org/T195625 [21:10:12] i've just noticed you filed it [21:10:41] oook, now I see what's up. (sorry in another meeting right now too) Yeah, effectively a revert of that skin is fine. [21:11:23] I'm still unclear what's exactly wrong with it. [21:12:03] Isarra: I think just announcing(?) [21:16:26] MatmaRex: [21:16:27] commonswiki: commonswiki 2018-05-25 21:14:55: 2.26% done on page; ETA 2018-05-26 02:40:08 [1455100/64403392] 3226.07/sec <0.00% updated> [21:16:35] it's gonna be a long time... [21:17:06] Reedy: lol [21:17:12] And a fucktonne of log lines [21:17:26] Reedy: well, can we let it run over the weekend? :P [21:17:34] Yeah, I'm not going to stop it [21:17:52] The log file is 31M so far [21:20:06] (03PS1) 10Faidon Liambotis: mirrors: update the rsync server for Tails [puppet] - 10https://gerrit.wikimedia.org/r/435287 [21:23:50] _joe_, herron: thanks, sure - I'll start it and see what I can get it to do [21:27:00] !log legoktm@tin Synchronized php-1.32.0-wmf.5/skins/MonoBook/: Temporarily remove responsive support (T195625) (duration: 01m 21s) [21:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:04] T195625: Implement a responsive layout for MonoBook - https://phabricator.wikimedia.org/T195625 [21:38:57] (03PS2) 10Faidon Liambotis: mirrors: update the rsync server for Tails [puppet] - 10https://gerrit.wikimedia.org/r/435287 [21:39:55] (03CR) 10Faidon Liambotis: [C: 032] mirrors: update the rsync server for Tails [puppet] - 10https://gerrit.wikimedia.org/r/435287 (owner: 10Faidon Liambotis) [21:50:30] Is it possible to get icinga pages just for outages of a single site ? :P [21:56:56] #wikidata... [21:57:04] (03Abandoned) 10Reedy: Disable updatequerypage for wikitech running on non silver host [puppet] - 10https://gerrit.wikimedia.org/r/292537 (https://phabricator.wikimedia.org/T136926) (owner: 10Reedy) [21:57:14] (03CR) 10Krinkle: "Superseded by I13932fc1c268b4a28f07c" [puppet] - 10https://gerrit.wikimedia.org/r/292537 (https://phabricator.wikimedia.org/T136926) (owner: 10Reedy) [22:10:10] (03PS5) 10Krinkle: Disable DisableAccount on wikis where there are no disabled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338792 (https://phabricator.wikimedia.org/T106067) (owner: 10Reedy) [22:39:48] 10Operations, 10Domains, 10Traffic: HTTP 500 on invalid domain - https://phabricator.wikimedia.org/T195568#4231062 (10Dzahn) > a domain that doesn't even exist. It exists. The issue is that "stats" exists in DNS in the wikipedia.org zone as an alias for stats.wikimedia.org and stats.wikimedia.org is point... [22:40:36] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Domains, 10Traffic: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4233112 (10Dzahn) [22:41:00] 10Operations, 10DBA, 10Regression: MySQL prompt missing trailing space on terbium - https://phabricator.wikimedia.org/T195636#4233115 (10Reedy) [22:41:05] 10Operations, 10DBA, 10Regression: MySQL prompt missing trailing space on terbium - https://phabricator.wikimedia.org/T195636#4233125 (10Reedy) p:05Triage>03Low [22:45:12] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Domains, 10Traffic: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4233129 (10Dzahn) option a) delete stats record from the wikipedia.org zone option b) add stats.wikipedia.org to hieradata/role/common/cach... [22:57:19] !log apt.wikimedia.org - import jenkins-debian-glue_0.18.4-wmf3 for jessie-wikimedia (T193910) [22:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:24] T193910: Build and upload jenkins-debian-glue_0.18.4-wmf3 for jessie - https://phabricator.wikimedia.org/T193910 [22:58:15] 10Operations, 10Continuous-Integration-Infrastructure: Build and upload jenkins-debian-glue_0.18.4-wmf3 for jessie - https://phabricator.wikimedia.org/T193910#4233150 (10Dzahn) 05Open>03Resolved [install1002:~/jenkins-debian-glue] $ export REPREPRO_BASE_DIR=/srv/wikimedia [install1002:~/jenkins-debian-glue...