[00:01:28] RECOVERY - cassandra-b service on restbase2004 is OK: OK - cassandra-b is active [00:05:52] PROBLEM - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [00:09:21] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2216392 (10Dzahn) created maint-announce@lists (for less confusion identical name but with .lists. ) https://lists.wikimedia.org/mailman/admin/maint-announce set archives to private added noc@ as admin, su... [00:09:41] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2216393 (10Dzahn) p:05Low>03Normal [00:13:02] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10Wikimedia-General-or-Unknown, 07HTTPS: securecookies - https://phabricator.wikimedia.org/T119570#2216425 (10Reedy) [00:13:08] (03PS1) 10BBlack: secure WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/284110 (https://phabricator.wikimedia.org/T119576) [00:13:10] (03PS1) 10BBlack: secure CP cookie [puppet] - 10https://gerrit.wikimedia.org/r/284111 (https://phabricator.wikimedia.org/T119576) [00:22:55] (03CR) 10Reedy: [C: 031] secure WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/284110 (https://phabricator.wikimedia.org/T119576) (owner: 10BBlack) [00:23:32] (03CR) 10Reedy: [C: 031] secure CP cookie [puppet] - 10https://gerrit.wikimedia.org/r/284111 (https://phabricator.wikimedia.org/T119576) (owner: 10BBlack) [00:24:22] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10Wikimedia-General-or-Unknown, 07HTTPS: securecookies - https://phabricator.wikimedia.org/T119570#2216483 (10Reedy) [00:24:26] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#2216482 (10Reedy) 05Resolved>03Open [00:26:56] (03CR) 10BBlack: [C: 032] secure WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/284110 (https://phabricator.wikimedia.org/T119576) (owner: 10BBlack) [00:29:57] !log kraz.codfw.wmnet - initial install, adding to site [00:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:59] (03PS1) 10BBlack: varnish redir: wmfusercontent.org -> www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/284112 (https://phabricator.wikimedia.org/T132452) [00:42:57] !log kraz - signing puppet certs, adding salt keys [00:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:39] 06Operations, 13Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216505 (10Dzahn) installed kraz.codfw.wmnet - added to puppet, salt, icinga, added mw-rc role [00:47:30] 06Operations, 13Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216508 (10Dzahn) next we need: ::Ircserver/Service[ircd]: Provider upstart is not functional on this host [00:52:27] (03PS2) 10Dzahn: interface: move rps::modparams to own file [puppet] - 10https://gerrit.wikimedia.org/r/284083 [00:56:54] 06Operations, 13Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216516 (10Dzahn) eh, and this needs a public IP, unlike antimony [00:59:33] 06Operations, 13Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216518 (10Dzahn) a:03Dzahn [01:00:31] (03PS1) 10Dzahn: kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - 10https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729) [01:04:35] (03CR) 10Ori.livneh: "Krinkle: I agree. How do we do it, though? I am loathe to hard-code which wikis are in each group, so it either has to be available client" [puppet] - 10https://gerrit.wikimedia.org/r/273990 (https://phabricator.wikimedia.org/T112557) (owner: 10Ori.livneh) [01:07:16] (03PS1) 10Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) [01:08:00] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Puppet has 3 failures [01:09:35] ACKNOWLEDGEMENT - puppet last run on kraz is CRITICAL: CRITICAL: Puppet has 3 failures daniel_zahn not fully installed yet [01:11:01] (03CR) 10Alex Monk: kraz.codfw.wmnet -> kraz.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:11:41] !log restbase2004 - unit cassandra-b is failed [01:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:13:05] (03CR) 10Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:13:11] (03PS2) 10Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) [01:13:22] (03CR) 10Krinkle: "argon, not antimony. Right?" [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:13:50] (03PS3) 10Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) [01:14:16] (03CR) 10Dzahn: "yes, it's clearly not a good idea that i work on both at the same time and time to take a break :)" [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:14:29] mutante: So what's the migration like? [01:14:47] We can have emit MediaWiki to both for a while, and then shut down the old one so that clients automatically reconnect to the old one at that point. [01:14:56] to the new one* [01:14:56] i don't know [01:15:18] but that sounds good [01:15:23] mutante: It is possible for sessions to remain on argon while we change irc.wikimedia.org do the new one, right? [01:15:32] no idea [01:16:07] I mean, the ability for irc to communicate doesn't relate to the hostname still resolving to the same IP, right? IRC only resolves the host when creationg the connection, not for each UDP packet. [01:16:46] once DNS roll out is complete (12 hours? 24 hours?) and all new clients use the new one and that one is working, we can shut it down and clients will just reconnect. Just like a reboot basically. [01:17:00] mutante: May wanna combine with the May 2nd deploument [01:17:06] which is when the next maintenance reboot is scheduled for irc [01:17:10] i don't know, i just installed a VM, nobody has talked about this [01:17:25] https://phabricator.wikimedia.org/T123729 [01:17:29] tomorrow at least one will exist [01:17:41] then we need unit files [01:17:44] Well, as long as nobody is shutting down or upgrading argon or changing irc.wikimedia.org destination yet :) [01:18:44] could even keep argon running until connections drop off naturally [01:18:47] i know that ticket but nothing about a scheduled reboot [01:19:07] mutante: https://phabricator.wikimedia.org/T122933 [01:19:19] The next change (and restart) is May 2nd [01:19:37] aha [01:19:38] announced so that people are aware of, since normally reboots must not happen unannouned on irc.wikimedia.org [01:20:22] ori: Yeah, if there's no rush to upgrade argon, then letting it drain naturally over a few days would be preferable. [01:20:43] there is a rush to upgrade [01:20:55] how strongly? [01:21:47] We announced the restart on May 2nd in March (1.5 month heads up). [01:21:57] So it seems unwise to restart or upgrade before that. [01:22:01] i cant quantify it but when trying to kill precise you never get to it, because there is always one small reason to not rush [01:22:08] until forever [01:23:26] the plan wasnt to upgrade or restart though, it was to start a new one and shutdown the old one [01:23:35] We can have the new VM set up and accepting connections now, then configure MW to feed it, switch DNS, and after the deployment on May 2nd argon can just go down (rather than restart) [01:23:42] as far as you can call it a plan [01:23:55] Krinkle: sounds good to me [01:23:56] mutante: Sure, but transition or restart, either way interupts users. [01:24:10] and should be announced :) [01:24:31] The current May 2nd change actually only involves mw-config changes, no restart of the service. [01:24:33] i know, that is what makes people not touch it [01:24:41] ok [01:25:03] So we can't ride it silently, we'll need to send out a separate announcement that we'll also restart (or rather, move) the service to a different server. [01:25:19] Most of which we can do before May 2nd if it's non-disruptive. [01:25:46] presumably configuring MW and changing DNS can all be done without affecting persistent connections. [01:26:22] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2216584 (10Peachey88) [01:26:58] and long before that ... [01:27:04] the services would have to start [01:31:21] 06Operations, 10Traffic, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2216597 (10Dzahn) Any news from legal? [01:31:42] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2216598 (10Dzahn) [01:32:33] ACKNOWLEDGEMENT - gitblit process on furud is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar gitblit.jar daniel_zahn still upcoming [01:33:50] Apr 19 01:31:18 restbase2004 cassandra[11544]: Exception encountered during startup: Other bootstrapping/leaving/moving nodes detected, cannot bootstrap w...t is true [01:33:53] Apr 19 01:31:18 restbase2004 cassandra[11544]: WARN 01:31:18 No local state or state is in silent shutdown, not announcing shutdown [01:34:55] !log restbase2004 - starting crashed cassandra-b service [01:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:36:31] (03CR) 10Ori.livneh: [C: 031] "the secure flag doesn't prevent the cookie value from being read by client-side javascript code, so this is fine." [puppet] - 10https://gerrit.wikimedia.org/r/284111 (https://phabricator.wikimedia.org/T119576) (owner: 10BBlack) [01:39:07] 06Operations, 10RESTBase-Cassandra: service cassandra-b fails on restbase2004 - https://phabricator.wikimedia.org/T132999#2216604 (10Dzahn) [01:40:25] ACKNOWLEDGEMENT - cassandra-b service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed daniel_zahn has been going on for many hours. notification issue ? - https://phabricator.wikimedia.org/T132999 [01:41:56] 06Operations, 13Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216620 (10Dzahn) a:05Dzahn>03None [01:44:30] PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: Puppet has 1 failures [01:44:49] 06Operations, 07Need-volunteer: smokeping config puppetization issue? - https://phabricator.wikimedia.org/T131326#2216635 (10Dzahn) p:05Triage>03Low [01:46:08] 06Operations, 10Monitoring, 07Graphite: Allow customizing the alert message from graphite - https://phabricator.wikimedia.org/T95801#2216637 (10Dzahn) [01:47:21] 06Operations, 10RESTBase, 06Services, 10Traffic: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216638 (10BBlack) [01:47:42] 06Operations, 10RESTBase, 06Services, 10Traffic: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2216653 (10BBlack) p:05Triage>03Normal [01:49:01] (03CR) 10BBlack: [C: 032] secure CP cookie [puppet] - 10https://gerrit.wikimedia.org/r/284111 (https://phabricator.wikimedia.org/T119576) (owner: 10BBlack) [02:02:59] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2091529 (10jayvdb) Offtopic a little perhaps, but is anything being done wrt trademark violation of enwikipedia.org ? If wmf legal cant / wont take that domain to trib... [02:05:46] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2216663 (10Peachey88) >>! In T128968#2216661, @jayvdb wrote: > Offtopic a little perhaps, but is anything being done wrt trademark violation of enwikipedia.org ? If wm... [02:06:40] !log systemctl mask cassandra-b on restbase2004.codfw.wmnet (it should not be running) [02:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:10:20] 06Operations, 10RESTBase-Cassandra: service cassandra-b fails on restbase2004 - https://phabricator.wikimedia.org/T132999#2216604 (10Eevans) This node should not be running, it is administratively down; I'm not sure what happened that it started to send notifications now. [02:11:21] RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:22:00] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 09m 47s) [02:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Apr 19 02:31:46 UTC 2016 (duration 9m 46s) [02:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:58] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 77 failures [02:37:57] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures [02:42:50] 06Operations, 13Patch-For-Review: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#2216681 (10Krinkle) Proposed migration plan after discussing with @Dzahn and @ori on IRC: * Set up kraz (Jessie; VM) to be a replacement for argon (Precise; metal). * Update MediaWiki wmf-config to broadcast... [02:43:49] 06Operations, 13Patch-For-Review: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2216684 (10Krinkle) [02:44:07] 06Operations, 13Patch-For-Review, 07developer-notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#1936574 (10Krinkle) [02:44:59] 06Operations, 13Patch-For-Review, 07developer-notice, 07notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#1936574 (10Krinkle) [02:51:42] (03PS1) 10Jcrespo: Repool pc1006 and pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284123 [02:53:07] (03CR) 10Ori.livneh: [C: 031] Repool pc1006 and pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284123 (owner: 10Jcrespo) [02:54:21] !log ori@tin Synchronized php-1.27.0-wmf.21/includes/api/ApiStashEdit.php: Ie9799f5ea: Segment stash edit cache stats by basis for hit/miss (duration: 00m 39s) [02:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:00:59] Krinkle, does argon really still need to identify the network as irc.pmtpa.wikimedia.org? [03:02:49] 06Operations, 13Patch-For-Review, 07developer-notice, 07notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2216717 (10Krenair) >>! In T123729#2216681, @Krinkle wrote: > manually connect to kraz with IRC and verify e.g. `/join #en.wikipedia` and look for events.... [03:10:47] (03CR) 10Jcrespo: [C: 032] Repool pc1006 and pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284123 (owner: 10Jcrespo) [03:12:32] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool pc2006 (duration: 00m 31s) [03:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:13:17] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool pc1006 (duration: 00m 28s) [03:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:09] "The MariaDB server is running with the --read-only option so it cannot execute this statement (10.192.16.170)" [03:14:57] /wiki/Main_Page [03:59:38] PROBLEM - Disk space on restbase1014 is CRITICAL: DISK CRITICAL - free space: /srv 185631 MB (3% inode=99%) [04:05:23] (03PS2) 10Krinkle: switchover: set mediawiki master datacenter to codfw [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/282898 (owner: 10Giuseppe Lavagetto) [04:14:14] PROBLEM - HHVM rendering on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time [04:14:23] PROBLEM - Apache HTTP on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time [04:16:23] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 65907 bytes in 0.094 second response time [04:16:24] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time [04:35:53] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:14] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:37:15] PROBLEM - Check size of conntrack table on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:37:34] PROBLEM - HHVM processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:37:43] PROBLEM - salt-minion processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:38:33] PROBLEM - RAID on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:38:34] PROBLEM - configured eth on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:38:44] PROBLEM - SSH on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:39:35] RECOVERY - salt-minion processes on mw1148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:40:43] RECOVERY - configured eth on mw1148 is OK: OK - interfaces up [04:41:44] RECOVERY - HHVM processes on mw1148 is OK: PROCS OK: 6 processes with command name hhvm [04:42:23] PROBLEM - DPKG on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:44:54] PROBLEM - nutcracker port on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:34] PROBLEM - dhclient process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:54] PROBLEM - salt-minion processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:03] PROBLEM - Disk space on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:53] PROBLEM - HHVM processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:55] PROBLEM - nutcracker process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:54] PROBLEM - configured eth on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:01:24] RECOVERY - Disk space on mw1148 is OK: DISK OK [05:01:44] RECOVERY - SSH on mw1148 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [05:01:44] RECOVERY - nutcracker port on mw1148 is OK: TCP OK - 0.000 second response time on port 11212 [05:01:54] RECOVERY - DPKG on mw1148 is OK: All packages OK [05:02:15] RECOVERY - RAID on mw1148 is OK: OK: no RAID installed [05:02:25] RECOVERY - configured eth on mw1148 is OK: OK - interfaces up [05:02:46] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.052 second response time [05:03:04] RECOVERY - Check size of conntrack table on mw1148 is OK: OK: nf_conntrack is 12 % full [05:03:16] RECOVERY - HHVM processes on mw1148 is OK: PROCS OK: 6 processes with command name hhvm [05:03:34] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 68330 bytes in 0.498 second response time [05:04:05] RECOVERY - dhclient process on mw1148 is OK: PROCS OK: 0 processes with command name dhclient [05:04:44] RECOVERY - nutcracker process on mw1148 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:05:15] RECOVERY - salt-minion processes on mw1148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:14:01] 06Operations, 10Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#2216853 (10Nemo_bis) OTRS can be used almost like a mailing list, if all members of a queue set up the notifications for it. Then you have archives and triaging. I'm not saying it's a good soluti... [05:16:22] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2216857 (10Nemo_bis) By "this" I assume you mean the list above? I'd like more comments on the methods proposed in the description. [05:22:07] (03PS1) 10Ori.livneh: Increase php memory-limit for ganglia-web from 256 to 768 [puppet] - 10https://gerrit.wikimedia.org/r/284129 [05:32:48] 06Operations: create a mailing list for maint-announce mail - https://phabricator.wikimedia.org/T132968#2216872 (10RobH) set to not announce itself on the mailing lists main page, as we wont allow anyone to subscribe to it. we should also set it to not allow anyone to post without moderation, and then we can wh... [05:33:46] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [05:37:41] (03PS1) 10Ori.livneh: ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - 10https://gerrit.wikimedia.org/r/284130 [05:38:42] (03PS2) 10Ori.livneh: Increase php memory-limit for ganglia-web from 256 to 768 [puppet] - 10https://gerrit.wikimedia.org/r/284129 [05:38:50] (03CR) 10jenkins-bot: [V: 04-1] ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - 10https://gerrit.wikimedia.org/r/284130 (owner: 10Ori.livneh) [05:39:56] (03PS2) 10Ori.livneh: ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - 10https://gerrit.wikimedia.org/r/284130 [05:40:20] (03CR) 10Ori.livneh: [C: 032] Increase php memory-limit for ganglia-web from 256 to 768 [puppet] - 10https://gerrit.wikimedia.org/r/284129 (owner: 10Ori.livneh) [05:40:45] (03PS3) 10Ori.livneh: ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - 10https://gerrit.wikimedia.org/r/284130 [05:41:54] (03CR) 10Matanya: [C: 031] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [05:42:15] (03CR) 10Ori.livneh: [C: 032] ganglia-web: don't send Content-Disposition header with JSON / CSV data [puppet] - 10https://gerrit.wikimedia.org/r/284130 (owner: 10Ori.livneh) [06:05:38] (03PS1) 10Ori.livneh: Revert "ganglia-web: don't send Content-Disposition header with JSON / CSV data" [puppet] - 10https://gerrit.wikimedia.org/r/284131 [06:05:57] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "ganglia-web: don't send Content-Disposition header with JSON / CSV data" [puppet] - 10https://gerrit.wikimedia.org/r/284131 (owner: 10Ori.livneh) [06:29:06] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: puppet fail [06:30:35] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:44] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:45] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [06:32:06] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:45] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:54] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:35] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:39] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2216915 (10elukey) [06:37:40] 06Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2216914 (10elukey) 05Resolved>03Open [06:37:53] 06Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2195218 (10elukey) Closed it too soon, I can see the root@ notifications again :( ``` elukey@cp4003:~$ ls /etc/logrotate.d/varnishkafka* /etc/logrotate.... [06:38:34] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: puppet fail [06:46:31] (03PS1) 10Elukey: Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - 10https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324) [06:55:25] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:55:55] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:16] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:25] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:25] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:58:15] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:58:25] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:34] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:39] (03PS4) 10Muehlenhoff: debdeploy: rename init.pp to master.pp to match class name [puppet] - 10https://gerrit.wikimedia.org/r/284082 (owner: 10Dzahn) [06:58:55] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:15] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] debdeploy: rename init.pp to master.pp to match class name [puppet] - 10https://gerrit.wikimedia.org/r/284082 (owner: 10Dzahn) [07:03:35] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [07:32:19] 06Operations, 10ops-codfw, 06Labs: labtestneutron2001.codfw.wmnet does not appear to be reachable - https://phabricator.wikimedia.org/T132302#2216978 (10MoritzMuehlenhoff) I'd say let's either reimage it or drop it from site.pp until reimaged. [07:34:14] (03CR) 10Giuseppe Lavagetto: [C: 031] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 (owner: 10Jcrespo) [07:43:56] (03PS2) 10Muehlenhoff: Setup meitnerium as the jessie-based archiva host [puppet] - 10https://gerrit.wikimedia.org/r/283956 [08:03:01] (03PS2) 10Elukey: Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - 10https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324) [08:16:30] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2073537 (10valhallasw) > * make a list of tools.wmflabs.org URLs and test them all for unsecure resources with a simple URL fetching script; > * some sm... [08:24:06] 06Operations, 10ops-eqiad, 10hardware-requests: connect an external harddisk with >2TB space to stat1001 - https://phabricator.wikimedia.org/T132476#2217091 (10elukey) stat1004 is a new server just created with tons of space: ``` elukey@stat1004:~$ df -h Filesystem Size Used Avail Use% Mounted on udev... [08:31:33] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2217117 (10Magnus) As a side note, I set up a VM for my PetScan tool: http://petscan.wmflabs.org/ This does not require http, but works for either, as... [08:36:11] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#2217130 (10hashar) p:05High>03Low [08:40:37] (03PS1) 10Muehlenhoff: Blacklist usbip kernel modules [puppet] - 10https://gerrit.wikimedia.org/r/284138 [08:41:27] (03PS7) 10Jcrespo: Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 [08:43:31] (03CR) 10Jcrespo: [C: 032] Install pt-heartbeat-wikimedia on all relevant servers [puppet] - 10https://gerrit.wikimedia.org/r/283979 (owner: 10Jcrespo) [08:44:48] 06Operations, 10Beta-Cluster-Infrastructure: HHVM core dumps in Beta Cluster - https://phabricator.wikimedia.org/T1259#2217138 (10hashar) 05Open>03Resolved I have deleted `/data/project/core` and created `/data/project/cores` which is where /proc/sys/kernel/core_pattern points to. Should be fine now. [08:46:36] (03PS2) 10Volans: MariaDB: configurations for codfw as primary [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) [08:50:03] (03CR) 10Filippo Giunchedi: [C: 031] shinken: Allow undefined data in graphite for disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/283779 (https://phabricator.wikimedia.org/T111540) (owner: 10Alex Monk) [08:50:04] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: puppet fail [08:50:25] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: puppet fail [08:50:30] mmm [08:50:51] jynus: I can take a look ^^ [08:51:05] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [08:51:07] no, it is my change [08:51:26] yes, Exec[pt-heartbeat-kill] [08:51:34] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: puppet fail [08:51:34] 'kill -TERM $(cat /var/run/pt-heartbeat.pid)' is not qualified and no path was specified. Please qualify the command or specify a path. [08:51:36] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: puppet fail [08:51:45] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: puppet fail [08:51:45] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail [08:51:55] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: puppet fail [08:52:15] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail [08:52:15] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail [08:52:34] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: puppet fail [08:53:25] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: puppet fail [08:53:25] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: puppet fail [08:53:47] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: puppet fail [08:53:55] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: puppet fail [08:54:05] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: puppet fail [08:54:05] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: puppet fail [08:54:07] PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: puppet fail [08:54:15] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: puppet fail [08:54:15] PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: puppet fail [08:54:15] PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: puppet fail [08:54:24] PROBLEM - puppet last run on pc2005 is CRITICAL: CRITICAL: puppet fail [08:54:34] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: puppet fail [08:55:04] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: puppet fail [08:56:06] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: puppet fail [08:56:15] PROBLEM - puppet last run on es2012 is CRITICAL: CRITICAL: puppet fail [08:56:55] PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: puppet fail [08:56:57] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: puppet fail [08:56:57] PROBLEM - puppet last run on pc2006 is CRITICAL: CRITICAL: puppet fail [08:57:14] PROBLEM - puppet last run on db1070 is CRITICAL: CRITICAL: puppet fail [08:57:15] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: puppet fail [08:57:21] (03PS1) 10Jcrespo: Add full path for kill command [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284139 [08:57:35] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [08:57:45] PROBLEM - puppet last run on pc1004 is CRITICAL: CRITICAL: Puppet has 1 failures [08:57:45] PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: puppet fail [08:57:45] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: puppet fail [08:58:00] (03CR) 10Volans: [C: 031] Add full path for kill command [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284139 (owner: 10Jcrespo) [08:58:05] PROBLEM - puppet last run on es2014 is CRITICAL: CRITICAL: puppet fail [08:58:16] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: puppet fail [08:58:40] (03CR) 10Jcrespo: [C: 032] Add full path for kill command [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/284139 (owner: 10Jcrespo) [08:58:46] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [08:58:55] PROBLEM - puppet last run on db1077 is CRITICAL: CRITICAL: puppet fail [08:58:59] (03PS1) 10Gehel: WIP - Use unicast instead of multicast for Elasticsearch node communication [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) [08:59:24] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: puppet fail [08:59:25] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: puppet fail [08:59:35] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: puppet fail [08:59:54] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: puppet fail [09:00:15] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: puppet fail [09:00:34] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail [09:00:44] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: puppet fail [09:00:57] (03PS1) 10Jcrespo: Correct mariadb error due to missing full patch [puppet] - 10https://gerrit.wikimedia.org/r/284141 [09:01:04] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail [09:01:05] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: puppet fail [09:01:08] (03PS2) 10Jcrespo: Correct mariadb error due to missing full patch [puppet] - 10https://gerrit.wikimedia.org/r/284141 [09:01:14] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: puppet fail [09:01:25] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: puppet fail [09:01:36] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail [09:01:36] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [09:01:50] (03PS2) 10Filippo Giunchedi: graphite: add graphite1003 [puppet] - 10https://gerrit.wikimedia.org/r/283989 [09:01:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: add graphite1003 [puppet] - 10https://gerrit.wikimedia.org/r/283989 (owner: 10Filippo Giunchedi) [09:02:02] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: puppet fail [09:02:04] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: puppet fail [09:02:12] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: puppet fail [09:02:22] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: puppet fail [09:02:22] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: puppet fail [09:02:24] (03PS3) 10Jcrespo: Correct mariadb error due to missing full patch [puppet] - 10https://gerrit.wikimedia.org/r/284141 [09:03:03] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: puppet fail [09:03:13] PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: puppet fail [09:03:33] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: puppet fail [09:03:33] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: puppet fail [09:03:34] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 1 failures [09:05:03] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: puppet fail [09:05:22] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: puppet fail [09:05:23] PROBLEM - puppet last run on es2017 is CRITICAL: CRITICAL: puppet fail [09:05:23] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: puppet fail [09:05:32] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: puppet fail [09:06:42] !log stop compactions on restbase1014 [09:06:43] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: puppet fail [09:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:06:52] RECOVERY - Disk space on restbase1014 is OK: DISK OK [09:07:12] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: puppet fail [09:07:12] PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: puppet fail [09:07:13] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [09:07:32] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: puppet fail [09:07:32] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: puppet fail [09:07:33] PROBLEM - puppet last run on db2035 is CRITICAL: CRITICAL: puppet fail [09:07:33] PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: puppet fail [09:08:03] PROBLEM - puppet last run on es2011 is CRITICAL: CRITICAL: puppet fail [09:08:03] PROBLEM - puppet last run on es2019 is CRITICAL: CRITICAL: puppet fail [09:08:23] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: puppet fail [09:08:23] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: puppet fail [09:08:42] PROBLEM - puppet last run on pc1005 is CRITICAL: CRITICAL: Puppet has 1 failures [09:09:02] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail [09:09:03] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [09:09:22] PROBLEM - puppet last run on db1057 is CRITICAL: CRITICAL: puppet fail [09:09:32] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: puppet fail [09:09:33] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: puppet fail [09:10:12] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail [09:10:23] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: puppet fail [09:10:32] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [09:10:52] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail [09:10:52] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: puppet fail [09:10:53] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [09:11:02] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: puppet fail [09:11:52] (03CR) 10Muehlenhoff: WIP - Use unicast instead of multicast for Elasticsearch node communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [09:11:53] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: puppet fail [09:11:55] !log stop cassandra and restbase on restbase1006 [09:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:12:02] PROBLEM - puppet last run on db1019 is CRITICAL: CRITICAL: puppet fail [09:12:12] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: puppet fail [09:12:44] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: puppet fail [09:12:52] PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: puppet fail [09:12:52] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: puppet fail [09:13:13] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: puppet fail [09:13:22] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: puppet fail [09:13:42] PROBLEM - puppet last run on db2043 is CRITICAL: CRITICAL: puppet fail [09:14:12] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: puppet fail [09:14:33] PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: puppet fail [09:14:53] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: puppet fail [09:15:04] (03CR) 10Jcrespo: [C: 032] Correct mariadb error due to missing full patch [puppet] - 10https://gerrit.wikimedia.org/r/284141 (owner: 10Jcrespo) [09:15:23] PROBLEM - puppet last run on db2048 is CRITICAL: CRITICAL: puppet fail [09:16:03] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: puppet fail [09:16:23] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: puppet fail [09:16:32] !log shutdown restbase100[56] [09:16:33] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: puppet fail [09:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:17:23] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:17:23] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: puppet fail [09:17:43] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:17:53] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [09:17:53] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:17:53] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: puppet fail [09:18:02] RECOVERY - puppet last run on es1016 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:18:02] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [09:18:14] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: puppet fail [09:18:34] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: puppet fail [09:18:34] RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:19:04] RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:19:04] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: puppet fail [09:19:13] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:19:13] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:19:24] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:19:33] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:19:43] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:19:52] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:20:02] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:20:02] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:20:02] PROBLEM - puppet last run on db2008 is CRITICAL: CRITICAL: puppet fail [09:20:22] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: puppet fail [09:20:23] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: puppet fail [09:20:32] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: puppet fail [09:20:32] RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:20:32] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:20:32] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: puppet fail [09:20:43] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: puppet fail [09:20:53] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:02] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:33] RECOVERY - puppet last run on pc2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:21:44] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:59] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2217272 (10fgiunchedi) thanks @papaul, rows should be B/C/D, one in each, if possible not located in the same rack as existing restbase systems [09:22:12] RECOVERY - puppet last run on db1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:22:12] RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:22:24] RECOVERY - puppet last run on db1074 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:22:43] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:22:52] RECOVERY - puppet last run on pc2006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:23:13] RECOVERY - puppet last run on db1070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:23:22] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:23:23] RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:23:33] RECOVERY - puppet last run on pc1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:23:53] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:23:53] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:02] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:24:13] RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:32] RECOVERY - puppet last run on db1077 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:24:54] RECOVERY - puppet last run on es2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:25:12] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:25:34] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:25:42] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:25:42] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:03] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:26:03] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:18] (03PS2) 10Gehel: Use unicast instead of multicast for Elasticsearch node communication [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) [09:26:33] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:26:33] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:27:03] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:03] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:33] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:34] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, and 2 others: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2217283 (10Gehel) Seems that 2 cluster restart are required to enable this change. Let's wait until the datacen... [09:27:42] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:43] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:28:02] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:28:02] RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:02] RECOVERY - puppet last run on db2062 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:28:03] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:04] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:28:20] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, and 2 others: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2217303 (10Gehel) a:03Gehel [09:28:42] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:28:43] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:28:43] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:29:54] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:30:32] RECOVERY - puppet last run on es2015 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:30:42] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:43] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:53] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:53] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:53] RECOVERY - puppet last run on db2035 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:30:53] RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:31:33] RECOVERY - puppet last run on es2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:51] (03PS1) 10Volans: MariaDB: set codfw local masters as masters (s1-s7) [puppet] - 10https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) [09:32:57] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:16] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:33:16] RECOVERY - puppet last run on es2017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:33:28] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:33:47] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:34:06] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:34:06] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:36] RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:46] RECOVERY - puppet last run on pc1005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:35:38] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:35:46] RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:35:46] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:35:47] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:35:58] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:36:00] (03CR) 10Volans: MariaDB: set codfw local masters as masters (s1-s7) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) (owner: 10Volans) [09:36:17] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:36:17] RECOVERY - puppet last run on es2011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:36:47] (03PS1) 10Filippo Giunchedi: cassandra: remove restbase100[56] [puppet] - 10https://gerrit.wikimedia.org/r/284145 [09:36:56] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:56] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:37:16] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:17] RECOVERY - puppet last run on db1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:27] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:56] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:06] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:38:06] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:38:16] RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:38:27] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:38:35] (03PS2) 10Volans: MariaDB: set codfw local masters as masters (s1-s7) [puppet] - 10https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) [09:38:38] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:38:47] RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:48] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:38:52] (03PS2) 10Filippo Giunchedi: cassandra: remove restbase100[56] [puppet] - 10https://gerrit.wikimedia.org/r/284145 (https://phabricator.wikimedia.org/T125842) [09:38:57] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:39:08] RECOVERY - puppet last run on db2043 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:39:27] RECOVERY - puppet last run on db2048 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:39:37] RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:39:38] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:39:46] RECOVERY - puppet last run on db1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:39:57] RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:40:02] (03PS1) 10Filippo Giunchedi: remove restbase100[56] [dns] - 10https://gerrit.wikimedia.org/r/284146 (https://phabricator.wikimedia.org/T125842) [09:41:17] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:43:17] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:43:47] RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:44:06] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:44:08] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:44:16] RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:44:27] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:44:58] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:45:06] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:45:17] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:17] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:45:27] RECOVERY - puppet last run on db2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:27] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:36] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:37] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:57] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:57] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:46:37] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:16] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:47:56] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:03] It's exploding [09:48:07] :D [09:48:20] (the channel I mean) [09:48:27] (03PS1) 10Jcrespo: Update dns records to get the current state [dns] - 10https://gerrit.wikimedia.org/r/284148 [09:52:29] (03CR) 10Volans: "Results from puppet compiler: https://puppet-compiler.wmflabs.org/2501/" [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:55:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: remove restbase100[56] [puppet] - 10https://gerrit.wikimedia.org/r/284145 (https://phabricator.wikimedia.org/T125842) (owner: 10Filippo Giunchedi) [09:59:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] remove restbase100[56] [dns] - 10https://gerrit.wikimedia.org/r/284146 (https://phabricator.wikimedia.org/T125842) (owner: 10Filippo Giunchedi) [10:01:03] 06Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2217422 (10fgiunchedi) @Cmjohnson I've deprovisioned restbase1005 and restbase1006 and both are shutdown, should be enough disks to get restbase1015 going now, thanks! [10:03:29] (03PS2) 10Jcrespo: Update dns records to get the current state [dns] - 10https://gerrit.wikimedia.org/r/284148 [10:05:23] (03CR) 10Jcrespo: [C: 032] Update dns records to get the current state [dns] - 10https://gerrit.wikimedia.org/r/284148 (owner: 10Jcrespo) [10:07:05] ACKNOWLEDGEMENT - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi cassandra-b masked [10:07:54] !log updated dns entries about mysql masters [10:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:47] (03CR) 10Volans: "Puppet changes here: https://puppet-compiler.wmflabs.org/2502/" [puppet] - 10https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) (owner: 10Volans) [10:13:44] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [10:15:44] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5010135 keys - replication_delay is 0 [10:17:39] (03PS3) 10Muehlenhoff: Setup meitnerium as the jessie-based archiva host [puppet] - 10https://gerrit.wikimedia.org/r/283956 [10:20:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Setup meitnerium as the jessie-based archiva host [puppet] - 10https://gerrit.wikimedia.org/r/283956 (owner: 10Muehlenhoff) [10:27:27] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2217463 (10elukey) 05Open>03Resolved [10:27:29] 06Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2217464 (10elukey) [10:34:58] (03CR) 10Gehel: Use unicast instead of multicast for Elasticsearch node communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [10:43:25] (03PS1) 10Ori.livneh: Configure ganglia-web to cache data in a location it can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/284155 [10:43:28] ^ paravoid [10:43:56] lol [10:44:58] (03CR) 10Faidon Liambotis: [C: 032] Configure ganglia-web to cache data in a location it can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/284155 (owner: 10Ori.livneh) [10:47:06] (03CR) 10QChris: [C: 032] "Yup, that deb looks good." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [10:55:26] (03PS1) 10Jcrespo: Set codfw databases in read-write [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284157 [10:56:39] (03CR) 10Ori.livneh: [C: 031] Set codfw databases in read-write [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284157 (owner: 10Jcrespo) [11:20:34] (03PS1) 10Muehlenhoff: rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 [11:21:50] (03CR) 10jenkins-bot: [V: 04-1] rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 (owner: 10Muehlenhoff) [11:22:28] 06Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2217535 (10BBlack) I've killed the rest of them, I think. I'll let you confirm->close this time :) [11:29:06] (03PS2) 10Muehlenhoff: rcstream: Update source range [puppet] - 10https://gerrit.wikimedia.org/r/284161 [11:33:04] !log changing binlog_format to STATEMENT for codfw masters for shards s1-s7 T124699 [11:33:05] T124699: Change configuration to make codfw db masters as the masters of all datacenters - https://phabricator.wikimedia.org/T124699 [11:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:51:46] (03PS3) 10Gehel: Use unicast instead of multicast for Elasticsearch node communication [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) [11:58:31] (03CR) 10Muehlenhoff: Use unicast instead of multicast for Elasticsearch node communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [12:03:28] (03CR) 10Gehel: Use unicast instead of multicast for Elasticsearch node communication (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [12:04:01] 06Operations, 10Mail, 10OTRS, 10WMDE-Fundraising-Software: add WMDE mx's to SpamAssassin trusted hosts to fix SPF softfails - https://phabricator.wikimedia.org/T83499#2217676 (10JanZerebecki) 05Open>03Resolved a:03JanZerebecki As shown above it might now be fixed. If my memory serves me right @silke... [12:07:00] (03PS4) 10Gehel: Use unicast instead of multicast for Elasticsearch node communication [puppet] - 10https://gerrit.wikimedia.org/r/284140 (https://phabricator.wikimedia.org/T110236) [12:09:42] (03CR) 10Muehlenhoff: "@QChris: Which version of bouncycastle is now required?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [12:12:31] (03CR) 10QChris: "> @QChris: Which version of bouncycastle is now required?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [12:13:50] (03CR) 10Paladox: "@QChris needs v+2 please." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [12:15:38] (03CR) 10QChris: "> @QChris needs v+2 please." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [12:16:56] 06Operations, 10hardware-requests: rack and set up graphite1003 - https://phabricator.wikimedia.org/T132717#2217707 (10fgiunchedi) a:05Cmjohnson>03fgiunchedi setting this up with jessie now [12:18:17] (03CR) 10Paladox: "Ok ok but I thought Jenkins would need to be re run for c+2 to take affect meaning Jenkins runs in gate and submit." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [12:25:13] (03CR) 10Addshore: "@paladox not if there is no gate and submit on this repo" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [12:28:37] PROBLEM - MariaDB Slave SQL: s6 on db2046 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1449, Errmsg: Error The user specified as a definer (root@208.80.154.151) does not exist on query. Default database: ruwiki. Query: DELETE /* GeoData\Hooks::doLinksUpdate 127.0.0.1 */ FROM geo_tags WHERE gt_id = 81263668 [12:28:46] on it^^^ [12:29:17] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.34 seconds [13:03:48] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [13:04:08] RECOVERY - MariaDB Slave SQL: s6 on db2046 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:17:49] Wikimedia Platform operations, serious stuff | Status: Up; codfw switchover / read-only at 14:00 UTC | Log: https://bit.ly/wikitech | Channel logs: http://ur1.ca/edq22 | Ops Clinic Duty: akosiaris [13:20:57] ori: how long is read-only? btw [13:21:42] <_joe_> c: the least amount of time possible [13:21:55] <_joe_> we hope to do it within 30 minutes [13:22:15] (03PS5) 10Addshore: WIP DRAFT WMDE_Analytics module [puppet] - 10https://gerrit.wikimedia.org/r/269467 [13:22:38] addshore: WIP DRAFT CALM DOWN J/K WMDE_Analytics module [13:22:45] :P [13:22:49] <_joe_> lol [13:22:56] should we do read-only at 14? or whenever it hits? [13:23:08] heh, accidently published that instead of actually keeping it a draft ;) [13:23:13] WIP DNM NOT FINISHED YET DRAFT WMDE ANALYTICS DNM!!! [13:23:29] <_joe_> jynus: we should do read-only when we're ready, it would be better if we're ready by 14:00 [13:24:10] <_joe_> so I guess the first items on the list should be done a few minutes earlier [13:24:12] the later of 14:00 and being ready :) [13:24:14] yes [13:24:15] so, the actual question was, should we start-not-so user impactin changes before that time? [13:24:17] <_joe_> that's me and ori [13:24:19] yes [13:24:33] ah, you answered before I ended up my question :-) [13:25:18] _joe_: I'll stop the jobrunners in eqiad at 13:55 [13:25:25] <_joe_> cool [13:25:52] (03PS1) 10Giuseppe Lavagetto: switchover: stop jobrunners in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284181 [13:26:28] (03PS1) 10Giuseppe Lavagetto: switchover: block maintenance scripts from running in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284182 [13:27:30] (03CR) 10Hoo man: [C: 04-1] "I still don't think we should have inline queries, but that's a PM decision in the end." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: 10Yurik) [13:31:44] _joe_, if you remove your -2 on https://gerrit.wikimedia.org/r/#/c/282904/ I'll +2 that. [13:32:06] <_joe_> subbu: we should deploy it when we've switched over mediawiki though [13:32:12] yes. [13:32:17] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2217917 (10Ottomata) a:05Ottomata>03elukey [13:32:20] i'll wait for a go before deploying. [13:32:52] 06Operations, 10ops-codfw, 06DC-Ops, 10EventBus, and 3 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#2217919 (10Ottomata) a:05Ottomata>03elukey [13:33:32] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2184249 (10Joe) @elukey don't install etcd on these machines for now, we need to come up with a good plan for that. [13:40:31] so, we'll follow https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki-related [13:40:32] (03PS1) 10Jcrespo: Depool one db server from each shard as a backup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284183 [13:40:53] with 0. being what jynus is doing now :P [13:41:30] sanity check here^, I need [13:41:47] I'll follow the script and announce every step here [13:41:52] <_joe_> paravoid: ok [13:42:11] <_joe_> paravoid: some steps can be done in parallel, like 2 and 3 [13:43:02] yup [13:43:03] <_joe_> and 7,8,9 and 10 as well [13:43:08] <_joe_> and then 11,12 [13:43:30] (7) is lacking the execution step btw [13:43:33] * MarkTraceur shuffles into his seat with a big foam finger [13:43:39] <_joe_> sorry, 11 and 12 [13:43:48] we're not going to warm up memcached (step 1b) only to wipe it in step 6, right? I think step 1 should exclude memcached [13:43:55] yes [13:43:59] that is obsolete now [13:44:02] <_joe_> paravoid: 7 has instructions [13:44:03] with replication [13:44:03] editing [13:44:13] I mean, we could do it [13:44:25] but I am not going to do it because it will have 0 impact [13:44:32] and actually, will probably fail [13:44:34] <_joe_> paravoid: 6 hasn't, but me and ori covered it, I can add the command there [13:45:06] I meant 8 [13:45:07] so monitoring may be out of sync with actual commands [13:45:15] (someone added 6 in the meantime :) [13:45:25] I meant the parsoid deploy, to be more clear [13:45:29] so do not freak out if we start to see replication lag issues [13:45:32] I'm adding salt commands for 9 (imagescalers) [13:45:49] as nagios may be updated asnyncly [13:46:02] <_joe_> paravoid: subbu is going to deploy parsoid [13:46:14] (03CR) 10Jcrespo: [C: 032] Depool one db server from each shard as a backup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284183 (owner: 10Jcrespo) [13:47:23] deploying the change now [13:47:27] wait for the log: [13:47:29] 06Operations, 05Continuous-Integration-Scaling: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#2217938 (10hashar) 05Open>03Resolved a:03hashar Got solved/agreed etc and we eventually have Nodepool installed on a machine in the labs support network. [13:47:37] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool one db per shard as a backup (duration: 00m 27s) [13:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:08] PROBLEM - puppet last run on mw2080 is CRITICAL: CRITICAL: puppet fail [13:49:28] ^ catalog fail [13:49:46] <_joe_> ? [13:49:53] <_joe_> so puppetmaster issue? [13:50:02] I'm not 100% sure, when that happens. probably [13:50:16] I'm in readonly mode right now though, I don't want to run the agent there and step on anyone [13:50:43] <_joe_> so, ori, I'm merging the puppet changes in the repo now [13:50:54] which puppet changes? [13:51:00] disabling jobrunners in eqiad [13:51:14] <_joe_> paravoid: points 2 and 3 [13:51:17] and block maintenance scripts from running in eqiad [13:51:24] sounds good to me [13:51:45] (03PS7) 10BBlack: acme-setup script + acme::init [puppet] - 10https://gerrit.wikimedia.org/r/283988 (https://phabricator.wikimedia.org/T132812) [13:52:01] (03CR) 10Giuseppe Lavagetto: [C: 032] switchover: stop jobrunners in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284181 (owner: 10Giuseppe Lavagetto) [13:52:20] (03PS2) 10Giuseppe Lavagetto: switchover: block maintenance scripts from running in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284182 [13:52:58] i'll wait until :55 exactly to salt the service stop command [13:53:04] yes please [13:53:13] <_joe_> yes that was the idea [13:53:20] I'll !log to signal the commence of the rollout [13:53:22] <_joe_> I'll puppet-merge at that moment exactly [13:53:29] then let's !log each step [13:53:33] <_joe_> yes [13:53:55] (03CR) 10Giuseppe Lavagetto: [C: 032] switchover: block maintenance scripts from running in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/284182 (owner: 10Giuseppe Lavagetto) [13:55:22] <_joe_> should we start paravoid ? [13:55:28] Am I the one deploying to tin step 4, when ready? [13:55:41] you were gonna start 39s ago :P [13:55:49] !log [switchover #1]: disabling eqiad jobrunners via "salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'". [13:55:57] yes, go ahead [13:56:14] I like the [switchover #N] notation too, let's use that consistently [13:56:25] yep [13:56:26] <_joe_> !log [switchover #3] disabling cronjobs on terbium [13:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:37] all deployments happening from tin, right? [13:56:45] subbu: yes [13:56:47] k [13:57:00] #1 completed and verified [13:57:08] <_joe_> 1? [13:57:13] <_joe_> it was 2 I thought :P [13:57:16] #1 is warmup databases :P [13:57:19] <_joe_> I am waiting on puppet [13:57:40] #2, right. sorry. [13:57:44] that's going to be a recurring theme today [13:57:49] (waiting on puppet) [13:58:01] we should have call them names [13:58:54] _joe_: signal when done [13:59:00] <_joe_> ok, tendril crons still active, but we can go on [13:59:02] * mark adds to learning: puppet too slow [13:59:16] we knew that already :P [13:59:21] ssssh [13:59:24] yes, tendril is no blocker/doesn't affect mediawiki [13:59:28] (03PS2) 10Faidon Liambotis: Put eqiad in read-only mode for datacenter switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283953 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [13:59:29] <_joe_> ok [13:59:38] let's move with #4 then? [13:59:41] <_joe_> yes [13:59:44] yes [13:59:57] <_joe_> I'll work on tendril in the meanwhile [14:00:06] about to merge #4 [14:00:14] thank you jynus [14:00:21] (03CR) 10Jcrespo: [C: 032] Put eqiad in read-only mode for datacenter switchover to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283953 (https://phabricator.wikimedia.org/T124699) (owner: 10Jcrespo) [14:01:21] !log [switchover #4] Set mediawiki-eqiad in read-only mode for datacenter switchover to codfw [14:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:51] wohoo! hour H is here! [14:02:01] wait for deployment confirmation [14:02:22] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Set mediawiki-eqiad in read-only mode for datacenter switchover to codfw (duration: 00m 35s) [14:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:45] ^that is the confirmation, secondary confirmation on edit save would be welcome [14:03:04] !log sites in planned readonly-mode, cf. http://blog.wikimedia.org/2016/04/18/wikimedia-server-switch/ [14:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:03:15] jynus: yes, confirmed [14:03:19] awesome [14:03:22] confirmed [14:03:24] if I go on edit I got the box on the right of the read-only [14:03:32] same here [14:03:41] LTR chauvinists [14:03:50] _joe_: will you do #5 (wipe memcached)? [14:03:51] :) [14:03:57] <_joe_> paravoid: yes, on it [14:04:00] ok to move ahead [14:04:01] thank you [14:04:11] that's #6, no? [14:04:21] confirmed anonymous edit has readonly block at top [14:04:29] yes that's #6 [14:04:42] QPS on the masters dropping too [14:04:57] (03CR) 10Faidon Liambotis: [C: 032] switchover: set mediawiki master datacenter to codfw [puppet] (switchover) - 10https://gerrit.wikimedia.org/r/282898 (owner: 10Giuseppe Lavagetto) [14:05:19] <_joe_> paravoid: you need to cherry-pick to production [14:05:22] ori: can you handle https://gerrit.wikimedia.org/r/#/c/282897/ next? (do not deploy just yet, just heads-up) [14:05:38] _joe_: that's the cherry-picked one, I believe [14:05:50] <_joe_> (switchover) [14:05:50] i'll rebase it but not merge yet [14:05:54] (03PS2) 10Ori.livneh: Switch wmfMasterDatacenter to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282897 (owner: 10Giuseppe Lavagetto) [14:06:05] I'm dpeloying that puppet change, ack? [14:06:11] this puppet change: https://gerrit.wikimedia.org/r/282898 [14:06:12] <_joe_> paravoid: we need to wait for #5 [14:06:25] yeah [14:06:27] Set $app_routes['mediawiki'] = 'codfw' in puppet (cherry-pick https://gerrit.wikimedia.org/r/282898) [14:06:42] er, right [14:07:16] also, that doesn't list a puppet run after merge, but should include one on eqiad+codfw text caches [14:07:17] jynus: are you doing "5. set eqiad databases (masters) in read-only mode."? [14:07:32] yes, it is not a blocker for the others [14:07:40] but doing it now [14:07:41] <_joe_> bblack: does it? [14:07:42] we just said it IS a blocker [14:07:52] I was about to say tht [14:08:03] oh sorry, ignore me [14:08:15] <_joe_> akosiaris: when paravoid merges that change, are you/mobrovac onto services and restbase? [14:08:33] _joe_: yeah [14:08:35] still, app_routes 282898 is listed before wmfMasterDatacenter 282897 [14:08:38] jynus: please confirm [14:08:41] wait [14:08:47] ack [14:08:55] jynus: on s1 master only heartbeat on tail -f of the binlog [14:08:56] not yet donoe [14:09:01] _joe_: akosiaris: only rb and scb nodes need the puppet run [14:09:10] akosiaris: you take scb, i'll take restbase [14:09:17] <_joe_> !log [switchover #6] disabled puppet on all redis hosts as a safety measure before inverting replication after the puppet change [14:09:18] ok [14:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:04] !log [switchover #5] DB Masters on eqiad set as read-only, and confirmed it [14:10:08] _joe_: can you confirm you wiped memcached? [14:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:23] spike on DB errors, looking [14:10:31] <_joe_> !log [switchover #6] wiped memcached [14:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:37] thank you [14:10:44] waiting for volans now [14:10:50] next step will be #7 [14:11:02] db errors should be normal, parsercache updates and other errors [14:11:14] <_joe_> paravoid: that change was not cherry-picked [14:11:33] (03PS1) 10Giuseppe Lavagetto: switchover: set mediawiki master datacenter to codfw [puppet] - 10https://gerrit.wikimedia.org/r/284189 [14:11:34] it is parsercache, volans [14:11:35] we're now 10 minutes into our 30 min read-only window [14:11:36] subbu: parsoid should be switched too as soon as _joe_ applies $app_route changes in puppet [14:11:41] most of them are REPLACE INTO `pc152` (keyname,value,exptime) VALUES jynus [14:11:44] yes [14:11:56] <_joe_> paravoid: when ready merge https://gerrit.wikimedia.org/r/284189 [14:11:59] _joe_: it's on top of current production and applies cleanly [14:12:00] everything seems ok on db side, we can continue [14:12:01] <_joe_> and tell us [14:12:11] (03CR) 10Yurik: [C: 031] "Hoo, could you elaborate? This is exactly the same approach as used by all the other api calls, such as pageviews api, MW api, etc. How i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: 10Yurik) [14:12:16] mobrovac, ok. let me know when. [14:12:20] I'm not sure what the difference that you see is [14:12:34] <_joe_> branch: production vs branch: switchover [14:12:36] i'll do https://gerrit.wikimedia.org/r/#/c/282897/ after paravoid [14:12:40] oh, ugh [14:12:53] <_joe_> heh [14:12:57] volans, jynus: ack to proceed? [14:13:03] I don't see 284189 in our directions at all, is that a replacement for something else? [14:13:14] paravoid, jynus> everything seems ok on db side, we can continue [14:13:19] ack [14:13:21] <_joe_> bblack: it's the cherry-pick of the change in the directions [14:13:23] yeah, that cherry-picking thing needs to go in our learnings :) [14:13:29] don't do that branch thing again :P [14:13:31] ok [14:13:32] ok [14:13:33] proceeding with #7 [14:13:33] haha [14:13:45] (03CR) 10Faidon Liambotis: [C: 032] switchover: set mediawiki master datacenter to codfw [puppet] - 10https://gerrit.wikimedia.org/r/284189 (owner: 10Giuseppe Lavagetto) [14:13:45] is in it already [14:14:01] (03CR) 10Faidon Liambotis: [V: 032] switchover: set mediawiki master datacenter to codfw [puppet] - 10https://gerrit.wikimedia.org/r/284189 (owner: 10Giuseppe Lavagetto) [14:14:07] <_joe_> paravoid: tell us when puppet-merged [14:14:16] lemme know when 7a 7b && 7c are done [14:14:29] <_joe_> mobrovac: they can go in parallel [14:14:30] !log [switchover #7] setging mediawiki master datacenter to codfw in puppet [14:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:42] it's palladium-merged [14:14:50] ok [14:14:51] ori: go ahead [14:14:51] <_joe_> !log [switchover #7] running puppet on mc* hosts in codfw [14:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:04] mobrovac: go ahead [14:15:05] did we merge the mw-config one yet? isn't that before all the puppet runs? [14:15:07] kk [14:15:10] !log [switchover #7] Switch wmfMasterDatacenter to codfw (https://gerrit.wikimedia.org/r/#/c/282897/) [14:15:13] subbu: go, you too [14:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:20] (03CR) 10Ori.livneh: [C: 032 V: 032] Switch wmfMasterDatacenter to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282897 (owner: 10Giuseppe Lavagetto) [14:15:20] ok. starting parsoid sync [14:15:26] bblack: ori just doing that [14:15:36] !log [switchover #7] puppet agent -tv && restbase restart [14:15:38] doesn't interact with the puppet runs that started first? [14:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:17] mobrovac/akosiaris: are you doing sca/scb? [14:16:22] !log [switchover #7] puppet agent -t -v on SCA, SCB cluster [14:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:28] awesome, thanks [14:16:30] synced code. restarting parsoid [14:16:33] we're now halfway our 30 min read-only window [14:16:36] !log ori@tin Synchronized wmf-config/CommonSettings.php: Idbfb0184d: Switch wmfMasterDatacenter to codfw (duration: 00m 30s) [14:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:49] let's proceed with #9, yes? [14:16:58] deploy the varnish change, that is [14:17:00] any objections? [14:17:07] <_joe_> paravoid: nope [14:17:19] <_joe_> as long as dbs are read-only in both dcs [14:17:25] the directions should've documented which steps are serial dependencies and which can be parallel [14:17:35] bblack: i already put that in the learning pad [14:17:35] dbs are read-only on both datacenters [14:17:36] in the learning [14:17:37] <_joe_> yes [14:17:38] ok [14:17:42] I'm assuming 7c (puppet runs) did *not* depende on 7b (mw-config) [14:17:52] i am getting this error with restart .. [14:17:52] subbu@earth:~$ for wtp in `ssh ssastry@bast1001.wikimedia.org cat /etc/dsh/group/parsoid` ; do echo $wtp ; ssh ssastry@$wtp sudo service parsoid restart ; done [14:17:52] cat: /etc/dsh/group/parsoid: No such file or directory [14:17:56] in spite of the "consequences" language? [14:17:56] what bast node should i use? [14:18:09] <_joe_> puppet is taking forever to run [14:18:11] <_joe_> of course [14:18:14] urgh [14:18:17] subbu: i do it [14:18:20] ok, thanks. [14:18:23] I was about to ask you akosiaris :) [14:18:32] (still seeing most of the traffic going to eqiad backends at this moment) [14:18:34] (this is probably an artifact of the bast1001 reinstall, I'm guessing) [14:18:44] * bblack prepping/rebasing for #9 (Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet) [14:18:47] RECOVERY - puppet last run on mw2080 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:18:49] thank you bblack [14:18:57] (03PS3) 10BBlack: switchover: switch api/appservers/rendering varnish routing from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282910 [14:19:00] ori: any objections from your side with moving forward with #9? [14:19:09] !log manually restarted parsoid on wtp1001 and confirmed html identical before/after switchover on enwiki:Hospet [14:19:11] no objections, paravoid [14:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:27] bblack: go ahead with #9 [14:19:38] !log [switchover #8] restarting parsoid on all wtp nodes [14:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:45] <_joe_> !log [switchover #7] memcached redises are now masters in codfw, running puppet on eqiad to start replicating [14:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:55] (03CR) 10BBlack: [C: 032 V: 032] switchover: switch api/appservers/rendering varnish routing from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/282910 (owner: 10BBlack) [14:20:12] poor puppetmasters ... 100% constantly [14:20:12] <_joe_> puppet runs in codfw are way slower than in eqiad btw [14:20:12] godog: prepare for #10 [14:20:26] paravoid: yup [14:20:41] (03PS4) 10Filippo Giunchedi: swift: switch to codfw imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/268080 (https://phabricator.wikimedia.org/T91869) [14:20:45] it can happen in parallel too, but since #9 is one of the most risky parts of the migration, let's wait for that to be done first [14:20:55] !log [switchover #9] varnish - change merged, puppet runs starting [14:20:57] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: NRPE: Unable to read output [14:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:02] yeah no problem in waiting for #10 [14:21:15] akosiaris, i am not seeing restart log messages on wtp2001.codfw.net in /var/log/parsoid/parsoid.log [14:21:21] did you restart codfw nodes as well? [14:21:26] ^checking PROBLEM [14:21:41] probably not an issue [14:21:42] subbu: yup [14:21:53] it just reported having done it successfully [14:21:59] PROBLEM - Redis status tcp_6379 on mc2004 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.37:6379 has 1 databases (db0) with 132831 keys [14:22:07] _joe_: ^^ [14:22:07] PROBLEM - Redis status tcp_6379 on mc2009 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.39:6379 has 1 databases (db0) with 125065 keys [14:22:08] PROBLEM - Redis status tcp_6379 on mc2012 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.42:6379 has 1 databases (db0) with 131907 keys [14:22:10] <_joe_> this is expected [14:22:21] subbu: is it ok now ? can you ack ? [14:22:30] <_joe_> paravoid: discard redis messages [14:22:30] i see it now. [14:22:33] ack [14:22:37] PROBLEM - Redis status tcp_6379 on mc2014 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.21:6379 has 1 databases (db0) with 154029 keys [14:22:37] PROBLEM - Redis status tcp_6379 on mc2007 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.37:6379 has 1 databases (db0) with 147886 keys [14:22:37] PROBLEM - Redis status tcp_6380 on mc2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.34:6380 has 1 databases (db0) with 142851 keys [14:22:38] ok, thanks [14:22:39] PROBLEM - Redis status tcp_6379 on mc2015 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.22:6379 has 1 databases (db0) with 146559 keys [14:22:39] PROBLEM - Redis status tcp_6379 on mc2002 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.35:6379 has 1 databases (db0) with 153457 keys [14:22:40] akosiaris, must have been a rolling retart. [14:22:46] subbu: yup [14:22:48] PROBLEM - Redis status tcp_6379 on mc2006 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.39:6379 has 1 databases (db0) with 148754 keys [14:22:49] PROBLEM - Redis status tcp_6379 on mc2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.34:6379 has 1 databases (db0) with 140053 keys [14:22:49] PROBLEM - Redis status tcp_6380 on mc2016 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.23:6380 has 1 databases (db0) with 152470 keys [14:22:57] expect also replication lag alerts regarding mysql, they are not blockers (they are a consequence of being read-only) [14:22:57] akosiaris, i confirmed on wtp2001.codfw and wtp1003.eqiad [14:23:02] why nobody sent an email saying "this is happening now, RO phase" ? [14:23:08] PROBLEM - Redis status tcp_6379 on mc2016 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.23:6379 has 1 databases (db0) with 175083 keys [14:23:08] PROBLEM - Redis status tcp_6379 on mc2008 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.38:6379 has 1 databases (db0) with 129407 keys [14:23:09] subbu: great! [14:23:27] PROBLEM - Redis status tcp_6379 on mc2010 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.40:6379 has 1 databases (db0) with 153045 keys [14:23:27] PROBLEM - Redis status tcp_6379 on mc2011 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.16.41:6379 has 1 databases (db0) with 153620 keys [14:23:28] PROBLEM - Redis status tcp_6379 on mc2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.38:6379 has 1 databases (db0) with 154362 keys [14:23:30] bblack: to be clear -- waiting for you to confirm [14:23:38] PROBLEM - Redis status tcp_6379 on mc2003 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.36:6379 has 1 databases (db0) with 155775 keys [14:23:38] PROBLEM - Redis status tcp_6379 on mc2013 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.20:6379 has 1 databases (db0) with 152217 keys [14:23:41] <_joe_> ori: are you doing the rdb hosts? [14:23:48] mark: mobrovac's comment in the learning pad please [14:23:51] paravoid: ack, still puppeting [14:23:54] application servers in codfw surging in traffic [14:23:57] _joe_: they just finished [14:24:02] <_joe_> cool [14:24:03] mobrovac: mail where? [14:24:03] :-) [14:24:14] <_joe_> can someone take a look at the error logs maybe? [14:24:17] mark: to wikitech, wikimedia, somewhere [14:24:28] it was announced... anyway, we'll discuss later [14:24:30] mark, confirm dbs in codfw are increasing in traffic, all nominal for now [14:24:37] PROBLEM - MySQL Replication Heartbeat on db1058 is CRITICAL: NRPE: Unable to read output [14:24:50] <_joe_> traffic coming to the api servers too [14:25:05] !log [switchover #9] varnish - puppet runs complete - done [14:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:10] awesome [14:25:15] <_joe_> cool [14:25:21] the site works for me [14:25:21] !log [switchover #7] restbase now uses MW from codfw [14:25:25] <_joe_> godog: you're up now :P [14:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:37] everyone confirm everything looks okay and we can move forward? [14:25:39] as a reminder, ignore alerts related to heartbeat/replication for now [14:25:46] godog: please go ahead [14:25:49] <_joe_> aye [14:25:49] +1 [14:25:52] (03CR) 10Lydia Pintscher: [C: 04-1] "We should not do this without a clearer understanding of how it is going to play together with the queries we have planned for on-wiki. I'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284091 (https://phabricator.wikimedia.org/T126741) (owner: 10Yurik) [14:26:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: switch to codfw imagescalers [puppet] - 10https://gerrit.wikimedia.org/r/268080 (https://phabricator.wikimedia.org/T91869) (owner: 10Filippo Giunchedi) [14:26:02] jynus: prepare for #11 :) [14:26:03] confirm anonymous cache-miss on enwiki works, and edit -> readonly block still [14:26:04] I am seeing a lot of 500s for GETs btw... [14:26:09] preparing for #11 [14:26:20] yeah, 5xxs are spiking [14:26:27] jynus: I think you can already kill pt-heartbeat in the eqiad masters [14:26:29] <_joe_> akosiaris: yes [14:26:35] it's very slow [14:26:40] volans, please help me with that [14:26:47] jynus: do not proceed with #11 yet, let's investigate the 500s first [14:26:48] PROBLEM - Redis status tcp_6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6380 has 1 databases (db0) with 5026135 keys [14:26:50] well first cache fill anyways, if I pick an unlikely page [14:26:54] as of 2 mins back, I see parsoid requests now coming to codfw .. stopped on eqiad. [14:27:00] paravoid: I've already puppet-merged, rollback #10 ? [14:27:02] jynus: sure, if I can do that, paravoid ok [14:27:04] godog: no [14:27:08] PROBLEM - Redis status tcp_6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6381 has 1 databases (db0) with 5026851 keys [14:27:08] PROBLEM - Redis status tcp_6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6480 has 1 databases (db0) with 5020764 keys [14:27:08] PROBLEM - Redis status tcp_6378 on rdb2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.119:6378 has 1 databases (db0) with 14 keys [14:27:15] <_joe_> jynus, volans can it be the parsercache? [14:27:15] lots of: Caused by: [Exception DBConnectionError] (/srv/mediawiki/php-1.27.0-wmf.21/includes/db/Database.php:743) DB connection error: Can't connect to MySQL server on '10.192.16.172' (4) (10.192.16.172) [14:27:27] that's es2018 [14:27:29] PROBLEM - Redis status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6479 has 1 databases (db0) with 5010242 keys [14:27:37] PROBLEM - Redis status tcp_6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.119:6379 has 1 databases (db0) with 9710825 keys [14:27:37] PROBLEM - Redis status tcp_6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.32.133:6481 has 1 databases (db0) with 5031574 keys [14:27:38] it could be a grant problem [14:27:51] checking it [14:27:52] some also: Caused by: [Exception DBConnectionError] (/srv/mediawiki/php-1.27.0-wmf.21/includes/db/Database.php:743) DB connection error: Too many connections (10.192.0.142) [14:28:04] PROBLEM - Redis status tcp_6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay data is missing - REDIS on 10.192.0.119:6380 has 1 databases (db0) with 5008314 keys [14:28:04] then, no, it is a saturation problem [14:28:06] RECOVERY - Redis status tcp_6379 on mc2014 is OK: OK: REDIS on 10.192.32.21:6379 has 1 databases (db0) with 155388 keys [14:28:06] RECOVERY - Redis status tcp_6379 on mc2007 is OK: OK: REDIS on 10.192.16.37:6379 has 1 databases (db0) with 149110 keys [14:28:06] RECOVERY - Redis status tcp_6380 on mc2001 is OK: OK: REDIS on 10.192.0.34:6380 has 1 databases (db0) with 144121 keys [14:28:08] I see 5xx in my varnish-fe graphs spiking around :22->:25, but seems to be coming back to normal now [14:28:13] (for cache_text) [14:28:14] RECOVERY - Redis status tcp_6381 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6381 has 1 databases (db0) with 5026823 keys [14:28:14] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. [14:28:14] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [14:28:15] RECOVERY - Redis status tcp_6378 on rdb2001 is OK: OK: REDIS on 10.192.0.119:6378 has 1 databases (db0) with 14 keys [14:28:15] RECOVERY - Redis status tcp_6480 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6480 has 1 databases (db0) with 5020753 keys [14:28:18] on es2018 [14:28:24] RECOVERY - Redis status tcp_6379 on mc2002 is OK: OK: REDIS on 10.192.0.35:6379 has 1 databases (db0) with 154783 keys [14:28:24] RECOVERY - Redis status tcp_6379 on mc2015 is OK: OK: REDIS on 10.192.32.22:6379 has 1 databases (db0) with 147761 keys [14:28:26] wfLogDBError.log is a mess [14:28:35] RECOVERY - Redis status tcp_6380 on mc2016 is OK: OK: REDIS on 10.192.32.23:6380 has 1 databases (db0) with 153809 keys [14:28:35] RECOVERY - Redis status tcp_6379 on mc2006 is OK: OK: REDIS on 10.192.0.39:6379 has 1 databases (db0) with 149998 keys [14:28:35] RECOVERY - Redis status tcp_6379 on mc2001 is OK: OK: REDIS on 10.192.0.34:6379 has 1 databases (db0) with 141181 keys [14:28:35] RECOVERY - Redis status tcp_6379 on mc2011 is OK: OK: REDIS on 10.192.16.41:6379 has 1 databases (db0) with 154663 keys [14:28:38] yup [14:28:41] there is high load on es2* servers [14:28:43] because of read-only mode though [14:28:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:28:45] RECOVERY - Redis status tcp_6479 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6479 has 1 databases (db0) with 5010234 keys [14:28:45] RECOVERY - Redis status tcp_6481 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6481 has 1 databases (db0) with 5031520 keys [14:28:45] RECOVERY - Redis status tcp_6379 on rdb2001 is OK: OK: REDIS on 10.192.0.119:6379 has 1 databases (db0) with 9710806 keys [14:28:46] RECOVERY - Redis status tcp_6379 on mc2008 is OK: OK: REDIS on 10.192.16.38:6379 has 1 databases (db0) with 130286 keys [14:29:04] RECOVERY - Redis status tcp_6379 on mc2005 is OK: OK: REDIS on 10.192.0.38:6379 has 1 databases (db0) with 155418 keys [14:29:05] RECOVERY - Redis status tcp_6380 on rdb2001 is OK: OK: REDIS on 10.192.0.119:6380 has 1 databases (db0) with 5008279 keys [14:29:06] RECOVERY - Redis status tcp_6379 on mc2016 is OK: OK: REDIS on 10.192.32.23:6379 has 1 databases (db0) with 176670 keys [14:29:23] is it read only or is it max connections? [14:29:24] RECOVERY - Redis status tcp_6379 on mc2010 is OK: OK: REDIS on 10.192.16.40:6379 has 1 databases (db0) with 154184 keys [14:29:35] RECOVERY - Redis status tcp_6379 on mc2013 is OK: OK: REDIS on 10.192.32.20:6379 has 1 databases (db0) with 153311 keys [14:29:37] 1425 open connections on es2018 [14:29:38] some are because of read-only mode and are thus noise [14:29:42] <_joe_> jynus: the 5xx logs seem to go down [14:29:49] <_joe_> so maybe it's getting better? [14:29:51] 2000 connections to es2* [14:29:56] out of threads [14:30:06] reached thread_pool_max_threads [14:30:13] that was within the reasonable, problems on first spike [14:30:14] RECOVERY - Redis status tcp_6379 on mc2004 is OK: OK: REDIS on 10.192.0.37:6379 has 1 databases (db0) with 134145 keys [14:30:19] <_joe_> yes [14:30:31] so far data from 14:29 onwards looks like 5xx is dropped back off mostly [14:30:35] should I go out of scrip and enable read-write on parsercaches? [14:30:36] 500s seem to be down [14:30:37] yes [14:30:38] <_joe_> was anyone logged in and is still logged in? [14:30:49] _joe_: i am, yes [14:30:55] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1168 bytes in 0.121 second response time [14:30:55] <_joe_> oh cool [14:30:55] _joe_: me too [14:30:59] waiting to see ig it is getting better [14:31:00] _joe_: logged in on wiki? yes me [14:31:05] RECOVERY - Redis status tcp_6379 on mc2012 is OK: OK: REDIS on 10.192.16.42:6379 has 1 databases (db0) with 133165 keys [14:31:05] dispatch lag also expected [14:31:08] _joe_: yes [14:31:09] <_joe_> that is expected (the wikidata lag) [14:31:13] 500s are getting back to normal [14:31:14] RECOVERY - Redis status tcp_6380 on rdb2005 is OK: OK: REDIS on 10.192.32.133:6380 has 1 databases (db0) with 5026014 keys [14:31:21] <_joe_> cool so sessions migrated correctly [14:31:24] RECOVERY - Redis status tcp_6379 on mc2009 is OK: OK: REDIS on 10.192.16.39:6379 has 1 databases (db0) with 126216 keys [14:31:26] we're now at the 30 mark of the read-only window [14:31:26] held #10, ready to resume btw [14:31:32] <_joe_> paravoid: should we go on? [14:31:37] volans, jynus: waiting for you to confirm whether the db load is back to normal again [14:31:40] _joe_: ^ [14:31:42] loadavg on es2018 is half now, seems recoverging [14:31:44] RECOVERY - Redis status tcp_6379 on mc2003 is OK: OK: REDIS on 10.192.0.36:6379 has 1 databases (db0) with 157194 keys [14:31:49] yes [14:31:52] recovered [14:31:54] 168 connecsions now [14:31:54] ok [14:31:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [14:31:55] <_joe_> mark: yeah we're a bit late, but not dramatically I'd say [14:31:58] I think we should go on [14:32:09] applying #11, wait for log [14:32:09] please proceed then [14:32:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [14:32:18] ok I'll finish up #10 [14:32:22] <_joe_> godog: go on yes :) [14:32:22] jynus: do you want me to kill pt-hearthbeast? [14:32:30] <_joe_> volans: he said so, yes [14:32:33] volans, please, go on [14:32:34] !log [switchover #10] running puppet on ms-fe and reload swift [14:32:35] at :30, data still shows a small increase in 500 (not 503), but it's fairly small in the overall [14:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:41] (03PS3) 10Jcrespo: MariaDB: set codfw local masters as masters (s1-s7) [puppet] - 10https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) (owner: 10Volans) [14:32:45] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [14:32:56] (03CR) 10Jcrespo: [C: 032 V: 032] MariaDB: set codfw local masters as masters (s1-s7) [puppet] - 10https://gerrit.wikimedia.org/r/284144 (https://phabricator.wikimedia.org/T124699) (owner: 10Volans) [14:33:00] ~ 5/sec 500s, vs usually near-zero [14:33:24] <_joe_> bblack: well let's see once the migration is done [14:34:04] !log [switchover #11] applying $master change for codfw masters [14:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:10] still only on palladium [14:34:16] it's dropping back to near-zero in :31-:32 data now, will know for sure in a couple more minutes [14:34:17] running puppet now [14:34:49] <_joe_> jynus: should you also set the codfw master to rw now, right? [14:34:57] <_joe_> or should we wait for this step to finish? [14:35:17] _joe_: the eqiad one you mean? [14:35:25] I hope not :P [14:35:30] <_joe_> paravoid: nope :P [14:35:40] I need heartbeat working first [14:35:45] then, set read-write [14:35:56] fail [14:36:02] did you mean #12, _joe_? [14:36:11] if that's the case, then no [14:36:43] puppet is very slow [14:37:05] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:37:21] <_joe_> yeah I meant #12 [14:37:25] PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: NRPE: Unable to read output [14:37:28] _joe_: you're so impatient :) [14:37:36] !log [switchover #10] puppet and swift reload finished [14:37:40] <_joe_> I wanted to help :P [14:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:52] ah good I was wondering about swift.. ok [14:38:21] <_joe_> gehel: is search all right? [14:38:24] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:38:28] just tried resizing an image, it worked [14:38:36] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:38:50] jynus: status? [14:39:11] puppet being run on the masters [14:39:30] <_joe_> we need to upgrade thos puppetmasters to ruby 2.1 [14:39:36] heartbeat killed on all eqiad masters [14:39:41] I think we can go on the next step [14:39:42] <_joe_> that would run way faster [14:39:45] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:39:48] what next step? [14:39:52] talk explicitly :) [14:39:54] with the only risk of getting replication alerts [14:40:01] let me see [14:40:03] #12 [14:40:09] #12 is the next step, read-write [14:40:14] but let's wait for #11 to be done first [14:40:15] no [14:40:18] search seems alright... [14:40:21] no go for 12 yet [14:40:39] I need to do substask #11 with is set db masters in read-write [14:40:41] doing it now [14:40:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:40:55] so next sub step [14:40:56] right we need 11a+b+c before 12, I believe [14:41:04] mark, yes, sorry [14:41:10] CirrusSearch sees a decrease in response time, just as expected... [14:41:10] no go yet for #12 [14:41:17] doing #11-2 [14:41:34] PROBLEM - MySQL Replication Heartbeat on db1023 is CRITICAL: NRPE: Unable to read output [14:41:43] ^ignore [14:41:48] what's the ETA for being ready for #12? [14:41:52] PROBLEM - MariaDB Slave Lag: s4 on db1056 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.03 seconds [14:41:53] 1 minute [14:42:05] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [14:42:22] PROBLEM - MariaDB Slave Lag: s5 on db2023 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 333.22 seconds [14:42:22] PROBLEM - MariaDB Slave Lag: s5 on db2066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 334.62 seconds [14:42:32] gehel: can you investigate wdqs in the meantime? [14:42:42] PROBLEM - MariaDB Slave Lag: s4 on db1064 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.46 seconds [14:42:49] PROBLEM - MariaDB Slave Lag: s4 on db1059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 368.94 seconds [14:42:49] PROBLEM - MariaDB Slave Lag: s4 on db2044 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 369.05 seconds [14:43:00] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 360.21 seconds [14:43:09] !log [swithchover #11-2] Set and confirmed codfw master dbs in read-write [14:43:09] PROBLEM - MariaDB Slave Lag: s5 on db1026 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 384.94 seconds [14:43:09] paravoid: sure [14:43:09] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 366.10 seconds [14:43:09] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 402.12 seconds [14:43:09] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 402.15 seconds [14:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:24] the errors come from the not finished puppet [14:43:27] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 418.29 seconds [14:43:27] PROBLEM - MariaDB Slave Lag: s7 on db1062 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 365.43 seconds [14:43:30] _joe_: as for puppet, 2.1 is not going to be miraculous; we just need to rely on it far less for runtime configuratons [14:43:34] they should not affect users [14:43:36] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 374.97 seconds [14:43:37] jynus: ack [14:43:41] <_joe_> paravoid: and that too, yes [14:43:42] we can go with #12 [14:43:46] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 437.65 seconds [14:43:46] jynus: can you confirm which #11 substeps are done now? [14:43:55] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 410.75 seconds [14:43:58] 11-1 (in progress) [14:44:04] 11-2 (done) [14:44:07] PROBLEM - MariaDB Slave Lag: s5 on db1045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 440.53 seconds [14:44:14] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 458.24 seconds [14:44:14] PROBLEM - MariaDB Slave Lag: s7 on db2040 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 405.28 seconds [14:44:15] 11-3 (aborted) [14:44:21] PROBLEM - MariaDB Slave Lag: es2 on es2014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 383.18 seconds [14:44:23] aborted? [14:44:25] ? [14:44:25] aborted? [14:44:33] not a blocker [14:44:34] PROBLEM - MariaDB Slave Lag: s7 on db1034 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 432.61 seconds [14:44:34] PROBLEM - MySQL Replication Heartbeat on db1052 is CRITICAL: NRPE: Unable to read output [14:44:34] PROBLEM - MariaDB Slave Lag: s5 on db1071 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 469.40 seconds [14:44:36] jynus: 11-3 looks done for me, I see read_only = OFF on codfw [14:44:41] masters [14:44:42] PROBLEM - MariaDB Slave Lag: s5 on db1049 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 469.55 seconds [14:44:42] PROBLEM - MariaDB Slave Lag: s5 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 470.33 seconds [14:44:44] jynus: 11-3 is "Set codfw masters mysql as read-write" [14:44:48] how is 11-3 aborted not a blocker? [14:44:49] PROBLEM - MariaDB Slave Lag: s6 on db1061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 451.00 seconds [14:44:49] PROBLEM - MariaDB Slave Lag: s5 on db1070 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 473.13 seconds [14:44:50] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 455.40 seconds [14:44:53] sorry [14:44:57] I meant 11-4 [14:44:57] maybe he means 11-4 [14:44:58] PROBLEM - MariaDB Slave Lag: s6 on db1037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 457.60 seconds [14:45:00] ah ok [14:45:01] ok [14:45:03] sorry about he confusion [14:45:04] ok [14:45:06] everyone ok to proceed with #12 then? [14:45:10] yes [14:45:11] PROBLEM - MariaDB Slave Lag: s4 on db1042 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 525.26 seconds [14:45:12] PROBLEM - MariaDB Slave Lag: es3 on es1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 433.70 seconds [14:45:12] PROBLEM - MariaDB Slave Lag: s4 on db1068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 525.99 seconds [14:45:12] PROBLEM - MariaDB Slave Lag: s5 on db2052 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 510.02 seconds [14:45:12] PROBLEM - MariaDB Slave Lag: s5 on db2059 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 510.43 seconds [14:45:17] yes [14:45:19] yes [14:45:20] PROBLEM - MariaDB Slave Lag: s7 on db2054 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 474.24 seconds [14:45:20] PROBLEM - MariaDB Slave Lag: s6 on db2028 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 491.41 seconds [14:45:20] +1 [14:45:21] please someone else apply #12 [14:45:24] ori: can you? [14:45:27] PROBLEM - MariaDB Slave Lag: es2 on es2016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 453.58 seconds [14:45:29] while I fix 11-1 [14:45:30] ok [14:45:33] thanks [14:45:38] (the cause of the alerts) [14:45:45] PROBLEM - MariaDB Slave Lag: s6 on db1030 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 514.52 seconds [14:45:45] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 551.45 seconds [14:45:50] (03PS2) 10Ori.livneh: Set codfw databases in read-write [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284157 (owner: 10Jcrespo) [14:46:00] (03CR) 10Ori.livneh: [C: 032] Set codfw databases in read-write [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284157 (owner: 10Jcrespo) [14:46:05] _joe_: can you prepare for #13/#14? [14:46:06] PROBLEM - MariaDB Slave Lag: es3 on es1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 479.09 seconds [14:46:13] PROBLEM - MariaDB Slave Lag: s5 on db2038 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 554.66 seconds [14:46:18] (03CR) 10Ori.livneh: [V: 032] Set codfw databases in read-write [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284157 (owner: 10Jcrespo) [14:46:20] PROBLEM - MariaDB Slave Lag: s6 on db1050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 536.55 seconds [14:46:20] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 537.65 seconds [14:46:20] <_joe_> paravoid: yes, but we have an issue there [14:46:29] <_joe_> O [14:46:29] PROBLEM - MariaDB Slave Lag: es3 on es1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 484.57 seconds [14:46:39] do tell [14:46:40] PROBLEM - MariaDB Slave Lag: es3 on es2017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 518.46 seconds [14:46:41] ACKNOWLEDGEMENT - High lag on wdqs1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [1800.0] Gehel investigating [14:46:41] ACKNOWLEDGEMENT - High lag on wdqs1002 is CRITICAL: CRITICAL: 36.67% of data above the critical threshold [1800.0] Gehel investigating [14:46:48] PROBLEM - MariaDB Slave Lag: s7 on db1028 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 562.45 seconds [14:46:55] jynus: ok to sync? [14:46:56] PROBLEM - MariaDB Slave Lag: es2 on es1011 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 542.55 seconds [14:47:02] so hgere is the thing- replication is running, it is the alerts that have an issue with the change [14:47:05] wfLogDBError back to low values [14:47:06] PROBLEM - MariaDB Slave Lag: es2 on es2015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 561.83 seconds [14:47:11] ori, please deploy [14:47:13] PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 523.56 seconds [14:47:20] PROBLEM - MariaDB Slave Lag: s7 on db1039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 585.15 seconds [14:47:28] PROBLEM - MariaDB Slave Lag: s6 on db1022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 606.52 seconds [14:47:37] PROBLEM - MariaDB Slave Lag: s7 on db2029 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 591.81 seconds [14:47:37] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 592.16 seconds [14:47:59] PROBLEM - MariaDB Slave Lag: es3 on es2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.25 seconds [14:47:59] all replicas are healtth and with 0 lag right now [14:48:05] I double checked [14:48:08] PROBLEM - MariaDB Slave Lag: es3 on es2018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.53 seconds [14:48:11] ok [14:48:14] I put it down in learnings [14:48:17] !log ori@tin Synchronized wmf-config/db-codfw.php: [switchover #12] I5e9635b8f4: Set codfw databases in read-write (duration: 00m 35s) [14:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:25] PROBLEM - MariaDB Slave Lag: es2 on es1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.97 seconds [14:48:28] so, all those pages are false positives ? [14:48:32] PROBLEM - MariaDB Slave Lag: es2 on es1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.24 seconds [14:48:32] RECOVERY - MariaDB Slave Lag: s7 on db1062 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [14:48:33] confirmed anon edit -> no readonly box [14:48:40] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.03 seconds [14:48:43] ori: confirmed no more alert on edit [14:48:48] !log sites are read-write again [14:48:50] PROBLEM - MariaDB Slave Lag: x1 on db2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.99 seconds [14:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:54] actually, no, db1029 replica is broken (not a blocker) [14:49:00] RECOVERY - MariaDB Slave Lag: s7 on db2040 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [14:49:00] RECOVERY - MariaDB Slave Lag: s7 on db1034 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [14:49:01] PROBLEM - MariaDB Slave Lag: x1 on db2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.42 seconds [14:49:15] perf save times are coming in, so users are saving [14:49:18] seeing rc changes on irc again for enwiki [14:49:20] so ~45 mins readonly [14:49:28] RECOVERY - MariaDB Slave Lag: s7 on db2054 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [14:49:33] I can indeed save! [14:49:34] 48 :) [14:49:43] _joe_: what is the issue you mentioned? [14:49:49] 5xx is normal-ish so far [14:49:50] 47 [14:49:50] so, ongoing issues I have: x1- broken replica to eqiad [14:49:51] PROBLEM - MySQL Replication Heartbeat on db1029 is CRITICAL: NRPE: Unable to read output [14:49:51] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: NRPE: Unable to read output [14:49:58] RECOVERY - MariaDB Slave Lag: s5 on db2038 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [14:49:58] PROBLEM - MariaDB Slave Lag: s4 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 805.70 seconds [14:50:08] <_joe_> paravoid: a hand-written config on rdb2003 [14:50:09] some potential improvements on weights [14:50:13] <_joe_> which broke replica [14:50:19] RECOVERY - MariaDB Slave Lag: s7 on db1028 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [14:50:24] argh [14:50:38] RECOVERY - MariaDB Slave Lag: s7 on db1039 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [14:50:43] verified a VE edit. [14:50:45] RECOVERY - MariaDB Slave Lag: s5 on db2023 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [14:50:45] (03CR) 10Chad: "@qchris: Yeah I knew it was gonna be a lengthy one. Probably should do it late on a Friday." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [14:50:55] RECOVERY - MariaDB Slave Lag: s5 on db2066 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [14:51:03] RECOVERY - MariaDB Slave Lag: s7 on db2029 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:51:03] RECOVERY - MariaDB Slave Lag: s7 on db2068 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [14:51:05] puppet execution was understimated [14:51:16] I didn't underestimate it :P [14:51:26] RECOVERY - MariaDB Slave Lag: s5 on db1026 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [14:51:36] _joe_: are you dealing with it? do you need any help? [14:51:45] <_joe_> paravoid: me and ori are [14:51:50] mark, orchestration out of puppet and I would not had this issue [14:51:56] jynus: I'm taking a closer look to all DBs with loadavg > 10, just a few [14:52:02] RECOVERY - MariaDB Slave Lag: s5 on db1045 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [14:52:03] RECOVERY - MariaDB Slave Lag: s5 on db1071 is OK: OK slave_sql_lag Replication lag: 0.06 seconds [14:52:08] yes, there are some overloaded [14:52:10] RECOVERY - MariaDB Slave Lag: s5 on db1049 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [14:52:10] RECOVERY - MariaDB Slave Lag: s5 on db2045 is OK: OK slave_sql_lag Replication lag: 0.35 seconds [14:52:20] RECOVERY - MariaDB Slave Lag: s5 on db1070 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [14:52:29] sorry for the spam [14:52:29] <_joe_> so, going on [14:52:30] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: NRPE: Unable to read output [14:52:31] RECOVERY - MariaDB Slave Lag: s5 on db2052 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [14:52:31] RECOVERY - MariaDB Slave Lag: s5 on db2059 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [14:52:33] but it should not have paged [14:52:33] _joe_: please use this channel to coordinate :) [14:52:46] <_joe_> paravoid: yeah we were just checking errors in query [14:52:53] <_joe_> I am ready to go on [14:53:04] is that issue dealt with? [14:53:18] (03PS1) 10Giuseppe Lavagetto: switchover: enable maintenance scripts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/284195 [14:53:19] <_joe_> yes [14:53:20] PROBLEM - MariaDB Slave Lag: x1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 899.61 seconds [14:53:22] ack [14:53:26] should i start the job queue in codfw? [14:53:28] proceed with #13/#14 [14:53:28] are maintenance scripts running? [14:53:34] <_joe_> jynus: in a minute [14:53:41] seems we have an issue with wdqs-updater not being able to update latest edits from wikidata. Not blocking... [14:53:48] _joe_: i'll do 13, you do 14, yes? [14:53:48] _joe_, can you hold them for a second? [14:53:54] <_joe_> yes [14:54:01] let's hold this off per jynus [14:54:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] switchover: enable maintenance scripts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/284195 (owner: 10Giuseppe Lavagetto) [14:54:07] volans and I need to check weights [14:54:08] <_joe_> oh [14:54:12] PROBLEM - MySQL Slave Running on db1029 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error executing row event: Cannot execute statement: impossible to w [14:54:14] 16:53:17 Specia:RecentChanges is not refreshing (feedback from en.wp and es.wp [14:54:14] 16:53:18 )16:53:17 Specia:RecentChanges is not refreshing (feedback from en.wp and es.wp [14:54:14] 16:53:18 ) [14:54:15] (03PS1) 10Ori.livneh: switchover: make jobrunners in codfw start up [puppet] - 10https://gerrit.wikimedia.org/r/284196 [14:54:17] just don't puppet-merge [14:54:18] I may need some adjustements [14:54:22] ori: ^^ [14:54:22] *it [14:54:27] holding off, ack [14:54:39] <_joe_> mark: that's because scripts are not running I think [14:55:07] _joe_: indeed [14:55:10] PROBLEM - Redis status tcp_6380 on rdb1004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.64.16.183 on port 6380 [14:55:18] <_joe_> uhm [14:55:19] huh? [14:55:20] <_joe_> checking [14:55:35] #en.wikipedia stream is only showing abusefilter logs, not edits [14:55:50] Krenair: that should be the same issue a Special:RC not populating [14:55:54] more feedback: file deletion is apparently throwing exceptions? (i recall some bug about this filed yesterday, so might not be new) [14:55:57] probably [14:56:17] <_joe_> Krenair: that could be the jobqueue [14:56:26] <_joe_> Krenair: rcstream works correctly? [14:56:29] <_joe_> can you test it [14:56:34] <_joe_> ? [14:56:46] godog: can you look at file deletions? [14:56:58] probably also the jobqueue [14:57:01] uploads get an exception too [14:57:01] [VxZHKQrAIE0AAIq6MWkAAABJ] /wiki/Special:Upload JobQueueError from line 200 of /srv/mediawiki/php-1.27.0-wmf.21/includes/jobqueue/JobQueueFederated.php: Could not insert job(s), 5 partitions tried. [14:57:10] mark: yup, checking [14:57:11] RECOVERY - Redis status tcp_6380 on rdb1004 is OK: OK: REDIS on 10.64.16.183:6380 has 1 databases (db0) with 9703267 keys - replication_delay is 8 [14:57:13] _joe_, no, I see only logs [14:57:16] might be jobqueue indeed [14:57:23] <_joe_> ori: ^^ [14:57:28] <_joe_> could not insert jobs [14:57:35] the aggregators aren't running [14:57:41] we should start those [14:57:43] <_joe_> ok [14:57:49] jynus: status? [14:57:52] #13/#14 for maint/jobqueue still on hold pending jynus [14:57:54] <_joe_> what is the aggregator? jobchron? [14:58:05] (03PS1) 10Jcrespo: Tweak db weights after switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284197 [14:58:13] I need to apply this^ [14:58:20] do it [14:58:30] ack [14:58:48] (03CR) 10Jcrespo: [C: 032] Tweak db weights after switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284197 (owner: 10Jcrespo) [14:58:54] there are some databases floping [14:59:01] due to excessive traffic [14:59:15] (03CR) 10Giuseppe Lavagetto: [C: 031] switchover: make jobrunners in codfw start up [puppet] - 10https://gerrit.wikimedia.org/r/284196 (owner: 10Ori.livneh) [14:59:43] MatmaRex: yeah sth like this I think? https://phabricator.wikimedia.org/T131769 [14:59:47] godog: same problem as https://phabricator.wikimedia.org/T132921 ? that was filed before the switchover [14:59:59] huh. more dupes [15:00:07] :( [15:00:10] 06Operations, 06Commons, 10MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2218083 (10matmarex) [15:00:12] !log applying database weight changes [15:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:27] !log jynus@tin Synchronized wmf-config/db-codfw.php: db weight teaking to better process the load (duration: 00m 28s) [15:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:32] jynus: can I go ahead? [15:00:37] jynus: shall we move on with #13/#14 or are you still investigating? [15:00:41] ori, paravoid please do ahead [15:00:43] <_joe_> ok [15:00:50] (03CR) 10Ori.livneh: [C: 032 V: 032] switchover: make jobrunners in codfw start up [puppet] - 10https://gerrit.wikimedia.org/r/284196 (owner: 10Ori.livneh) [15:00:56] jynus: s4, s6, es2 and es3 have still alerts on icinga for replication lag for codfw [15:00:59] <_joe_> ori: I'll puppet-merge [15:01:11] volans, checking [15:01:15] _joe_: ack [15:01:16] wikis are still user-visibly broken [15:01:23] Krenair: how so? [15:01:27] paravoid, no RC [15:01:33] yeah ok [15:01:47] <_joe_> ori: merged [15:02:19] PURGE ramping in as expected from jobrunners running already [15:02:46] <_joe_> !log [switchover #13] starting maintenace jobs [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:03] RECOVERY - MariaDB Slave Lag: es3 on es1019 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [15:03:21] RECOVERY - MariaDB Slave Lag: es3 on es2018 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [15:03:22] (03PS1) 10Chad: Use legacy key exchanges on yurud, like antimony [puppet] - 10https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718) [15:03:32] (not at full usual volume yet, but moving the right direction) [15:03:39] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 2.08 seconds [15:04:01] RECOVERY - MariaDB Slave Lag: s6 on db1037 is OK: OK slave_sql_lag Replication lag: 1.13 seconds [15:04:03] volans, jynus I see a bunch of "Deadlock found when trying to get lock; try restarting transaction (10.192.0.12)" from ContentTranslation on the DB error log [15:04:12] RECOVERY - MariaDB Slave Lag: s4 on db1042 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [15:04:12] RECOVERY - MariaDB Slave Lag: es3 on es1017 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [15:04:15] MatmaRex: those takss you linked re deleting files are different to what is happening now [15:04:21] RECOVERY - MariaDB Slave Lag: s4 on db1068 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [15:04:21] RECOVERY - MariaDB Slave Lag: es3 on es1014 is OK: OK slave_sql_lag Replication lag: 0.38 seconds [15:04:21] RECOVERY - MariaDB Slave Lag: s6 on db1022 is OK: OK slave_sql_lag Replication lag: 0.19 seconds [15:04:30] RECOVERY - MariaDB Slave Lag: s6 on db2028 is OK: OK slave_sql_lag Replication lag: 0.29 seconds [15:04:33] 06Operations, 06Commons, 10MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10Bawolff) >"Unable to delete file pages on commons: "Could not acquire lock"". Locks use redis, probably related to switchover. [15:04:33] paravoid, that is "normal" [15:04:37] alright [15:04:39] I still don't edits in RC, and I am not sure I understand why. [15:04:42] RECOVERY - MariaDB Slave Lag: s6 on db1030 is OK: OK slave_sql_lag Replication lag: 0.37 seconds [15:04:50] RECOVERY - MariaDB Slave Lag: s4 on db2019 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [15:04:53] indeed [15:05:00] ok, puppet was stuck, probably by my fault [15:05:09] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [15:05:18] RECOVERY - MariaDB Slave Lag: es3 on es2017 is OK: OK slave_sql_lag Replication lag: 0.63 seconds [15:05:19] does the same hold for stream.wikimedia.org etc? [15:05:19] ori, it could be one of those dbs that handle recent changes [15:05:23] yes mark [15:05:28] which wiki, ori? [15:05:33] Krinkle, FYI, RC is down but editing is allowed [15:05:35] en (https://en.wikipedia.org/wiki/Special:RecentChanges) [15:05:38] RECOVERY - MariaDB Slave Lag: s4 on db1059 is OK: OK slave_sql_lag Replication lag: 0.04 seconds [15:05:40] <_joe_> ori: we're not enqueueing jobs it seems https://grafana.wikimedia.org/dashboard/db/job-queue-health [15:05:46] 06Operations, 06Commons, 10MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10Addshore) >>! In T132921#2218096, @Bawolff wrote: >>"Unable to delete file pages on commons: "Could not acquire lock"". > > Locks use re... [15:05:47] RECOVERY - MariaDB Slave Lag: s4 on db2044 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [15:05:55] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [15:06:03] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [15:06:03] RECOVERY - MariaDB Slave Lag: s4 on db1056 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [15:06:03] RECOVERY - MariaDB Slave Lag: s6 on db1050 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [15:06:03] RECOVERY - MariaDB Slave Lag: s4 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [15:06:18] 06Operations, 06Commons, 10MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2218103 (10matmarex) Switchover is today, these go as far back as April 4 (T131769, should be duped). [15:06:22] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [15:06:23] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Replication lag: 0.46 seconds [15:06:26] <_joe_> but well, that graph is clearly broken [15:06:31] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [15:06:42] <_joe_> so there is some problem with locks on redis for images? [15:06:42] AaronSchulz: are you around? [15:06:48] <_joe_> yeah we need aaron [15:06:52] RECOVERY - MariaDB Slave Lag: s6 on db1061 is OK: OK slave_sql_lag Replication lag: 0.47 seconds [15:06:52] RECOVERY - MariaDB Slave Lag: s4 on db1064 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [15:06:53] _joe_: yes, but unrelated to swithcover [15:06:55] well, my PURGE volume still isn't up to speed either, which is also jobq-driven mostly. it started to ramp up, but then didn't really [15:07:00] _joe_: looks like predating the switchover [15:07:00] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [15:07:05] hhvm logs are full of redis connection errors -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm [15:07:08] RECOVERY - MariaDB Slave Lag: es3 on es2019 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [15:07:09] _joe_: but the JobQueueErrors are related to switchover [15:07:12] separate issues [15:07:29] <_joe_> rdb2005.eqiad? [15:07:30] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [15:07:32] <_joe_> who did that [15:07:35] <_joe_> me probably [15:07:36] <_joe_> idiot [15:07:39] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.40 seconds [15:07:40] haha [15:07:51] yyup [15:08:00] $wmfAllServices['codfw']['jobqueue_redis'] = array( [15:08:02] is all broken [15:08:10] <_joe_> paravoid: fixing now [15:08:13] alright [15:08:20] RECOVERY - MariaDB Slave Lag: es2 on es2016 is OK: OK slave_sql_lag Replication lag: 0.49 seconds [15:08:31] RECOVERY - MariaDB Slave Lag: es2 on es1011 is OK: OK slave_sql_lag Replication lag: 0.14 seconds [15:08:42] _joe_: $wmfAllServices['codfw']['jobqueue_aggregator'] = array( [15:08:44] too [15:09:07] volans, what priorities do you see regarding dbs? [15:09:11] RECOVERY - MariaDB Slave Lag: es2 on es1015 is OK: OK slave_sql_lag Replication lag: 0.02 seconds [15:09:29] RECOVERY - MariaDB Slave Lag: es2 on es2015 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [15:09:34] (03PS1) 10Giuseppe Lavagetto: Fix codfw redis hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284201 [15:09:35] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218112 (10Gehel) [15:09:43] <_joe_> paravoid: ^^ a quick look? [15:09:49] RECOVERY - MariaDB Slave Lag: es2 on es2014 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [15:09:50] RECOVERY - MariaDB Slave Lag: es2 on es1013 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [15:09:59] looks good [15:10:03] (03CR) 10BBlack: [C: 031] Fix codfw redis hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284201 (owner: 10Giuseppe Lavagetto) [15:10:07] already syncing live-hacked fix [15:10:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix codfw redis hostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284201 (owner: 10Giuseppe Lavagetto) [15:10:16] You fixed RC, ori? [15:10:16] identical to change [15:10:17] <_joe_> ori: aha [15:10:18] !log ori@tin Synchronized wmf-config/ProductionServices.php: live-hack fix for rdb2*.eqiad (duration: 00m 34s) [15:10:22] <_joe_> Krenair: yes [15:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:27] jynus: a tail of all error logs for anomalies, checking tendril for performances and loads on the hosts, keeping an eye on icinga [15:10:30] rc is back [15:10:32] purges should be too [15:10:36] gehel: the wqds breakage is probably related to the RC breakage [15:10:38] confirmed [15:10:46] volans, I think x1 is a hot spot, there are multiple replicaion breakages [15:10:53] <_joe_> I had serveral reviewers here.... [15:10:56] <_joe_> bd808: thanks [15:11:13] oooh, got this interesting one, may be related to the switchover, https://www.wikidata.org/wiki/Q23889824 Exception encountered, of type "BadMethodCallException" [15:11:15] FWIW: git grep -E '[a-z]+2[0-9]+\.eqiad' on mediawiki-config says no such other errant hostnames [15:11:25] we should get a CI check for "hostname2xx.eqiad" etc [15:11:30] I was looking at it just now from icinga [15:11:36] volans, I think most of the current alerts are due to the old masters [15:11:38] <_joe_> bd808: what is Warning: Cannot modify header information - headers already sent in /srv/me... [15:11:44] <_joe_> there are a ton of those too [15:11:48] <_joe_> they sound dangerous [15:11:49] argon is also back broadcasting rc changes [15:11:52] mark: Should be easy, filing a task. [15:11:54] !log testing the log by logging a test [15:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:04] <_joe_> like someone leaving a blank space somewhere [15:12:08] paravoid: thanks, I'm still trying to understand where the wdqs updates come from... but it seems that wdqs-updater does polling, not RCStream... [15:12:11] _joe_: something trying to set a cookie late I would guess, but let me look [15:12:13] ori, so is RC going to be repopulated? [15:12:17] trying flow [15:12:26] jynus: apart x1 yes, all the alarms are on eqiad for DBs [15:12:32] Krenair: probably not [15:12:47] I am aware that that is an issue [15:12:52] ori, so what are we going to do about any vandalism occurring between editing being reallowed and RC being fixed? [15:12:54] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218144 (10Gehel) Might be related to RCStream breaking during the switch. [15:13:11] <_joe_> Krenair: we're still looking at things now [15:13:17] come across it and fix it [15:13:22] volans, due to puppet delay, pt-heartbeat was not executed properly just after the kill, but that did not cause user problems this time [15:13:25] it seems RB started getting some reqs from the job queue [15:13:29] <_joe_> can we leave discussions for later, please? [15:13:32] yeah, jq is back [15:13:33] there was always going to be such a window in the procedure, it just went a little longer than expected [15:13:38] <_joe_> mobrovac: yeah the problem was the misconfiguration [15:13:44] kk [15:13:47] <_joe_> bblack: not longer than I expected [15:13:51] I have successfully deleted spam [15:13:52] * AaronSchulz reads backscroll [15:13:55] Great work!!!!! [15:14:07] so, done with MW? [15:14:11] shall we move on to traffic? [15:14:16] <_joe_> paravoid: I think so yes [15:14:17] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2218145 (10demon) [15:14:21] _joe_: the function that is erroring is WebResponse::header(). Will need to find stacktrace to figure out what header is trying to go out. [15:14:24] do we want to further confirm no need to rollback on any MW before moving traffic? [15:14:33] no need to rush into traffic I think? [15:14:37] <_joe_> bd808: I think it stopped [15:14:44] AaronSchulz: tl;dr: codfw aggregators were "rdb200x.eqiad" instead of "rdb200x.codfw", just a typo. but if there's a way to generate RC events for revisions created during that time that would be good. [15:15:00] there is rebuildrecentchanges.php [15:15:09] I would like to test Recent changes on several wikis [15:15:15] it takes several hours [15:15:18] <_joe_> AaronSchulz: and you did review that change :P [15:15:23] it is usually a cause of problems for my service [15:15:24] <_joe_> so it's on me and on you [15:15:59] _joe_: *nod* it could have been a result of the redis problem. If MW was trying to send an error response after it had already spit out the page that error would happen [15:16:10] <_joe_> bd808: I think it is, yes [15:16:12] it's not on anyone, blame isn't useful or nice. we just need more safeguards for that next time [15:16:13] to be clear, the discussed traffic steps are in https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Specifics_for_Switchover_Test_Week - this moves user routing so the last-mile isn't eqiad-varnish -> codfw-apps, and gets users off of eqiad frontends too. [15:16:27] <_joe_> ori: I'm just joking, I thought it was clear [15:16:32] ok :) [15:16:47] I saw the pile of errors and "eqiad" didn't jump out at me either [15:16:56] bblack: I think that's good to go from MW's perspective, I can't imagine we'd roll back now [15:16:57] <_joe_> the :P was just to avoid that kind of confusion :) [15:17:11] Krenair: ori: rebuildrecentchanges starts with a DELETE * FROM, though. [15:17:13] <_joe_> ori: let's wait for everyone to confirm we're ok [15:17:22] volans, I think you may have killed heartbeat on db1001, which is misc and not part of the failover, can you confirm (if not, it is a bug) [15:17:29] and it's kind of lame. loses patrolling information, for example [15:17:35] jynus: double checking [15:17:37] (not an issue, either) [15:17:38] so just running it might do a disservice [15:17:40] yeah, I wouldn't risk it [15:17:50] edit.success in graphite now close to pre-switchover levels [15:17:52] deletes won't make jynus happy either [15:17:53] i agree it would be good to repopulate if possible. but we'd need to tweak it [15:17:58] volans, if you killed it, it is good news [15:17:58] <_joe_> btw, we've dropped the jobqueue repeateldy in the past [15:18:04] godog: great idea to check that, thanks [15:18:09] (i guess to just rebuild in the affected time range, rather than all) [15:18:21] jynus: no I didn't [15:18:23] <_joe_> as in, it crashed and the redis db was corrupted [15:18:25] parsoid codfw cluster is slowing getting back to "normal" from 0% .. courtesy jobrunners. grafana oldid wt2html rates going back up to previous levels. verified ve edit rc stream on enwiki and itwiki .. so, all good from the parsoid side. [15:18:26] mmm [15:18:26] <_joe_> before ori fixed it [15:18:27] sorry [15:18:34] then puppet killed it :-/ [15:18:39] subbu: thanks subbu [15:18:46] <_joe_> subbu: great! [15:19:04] ori: np, also added a learning/question on why it never dropped to zero [15:19:20] ori, bblack: the only reason that I can think of for holding traffic a little while longer would be performance metrics [15:19:27] statsd buffers [15:19:28] as in, if we want to measure the two events independently [15:20:19] I just want to be sure there's no lingering issues with the final bits (maint/jq) that are going to cause us to want to undo the switch [15:20:22] <_joe_> the queue became way larger, me might need to check it if it doesn't reduce in reasonable times [15:20:24] if we're confident on that, I'm ok [15:20:28] i went into this with the mindframe that performance takes a back seat to correctness / availability, so my preference would be to stick with the process and run separate controlled experiments if we want to find out more about the impact of these routing changes [15:20:30] I doubt we'd undo at this point [15:20:37] <_joe_> bblack: as far as maint is concerned, there is no issue [15:20:47] ok [15:21:03] i'll be back in <5m [15:21:06] <_joe_> jobqueue seems ok to me, but I'd like confirmation on changes [15:21:14] I am currently happy, given some issue, the only reason I would rollback now is if there were issues with redis/swift [15:21:16] <_joe_> honestly, I'd wait for wikidata sync to recover [15:21:52] wikidata dispatch lag is heading down now :) [15:22:20] analytics1052 has lots of issues in icinga, may be failing, probably unrelated? [15:22:23] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1677 bytes in 0.194 second response time [15:22:36] <_joe_> oh shit [15:22:46] and mira still unmerged changes in mediawiki_config [15:22:49] <_joe_> did we merge wait [15:22:52] PROBLEM - configured eth on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:00] _joe_: elaborate? [15:23:07] <_joe_> sorry false alarm [15:23:11] PROBLEM - Check size of conntrack table on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:11] PROBLEM - Disk space on Hadoop worker on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:11] PROBLEM - puppet last run on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:12] i merged on tin but did not sync because it's a no-op, identical to what i deployed before [15:23:19] i'll sync it anyway to quiet mira [15:23:22] PROBLEM - DPKG on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:35] <_joe_> I could not find trace of $wmfMasterDatacenter = 'codfw'; in grepping mediawiki-config [15:23:38] PROBLEM - Hadoop JournalNode on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:41] PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:23:43] <_joe_> and I was just on the wrong branch [15:23:53] PROBLEM - salt-minion processes on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:01] PROBLEM - Hadoop DataNode on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:01] PROBLEM - RAID on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:10] while I'm running through icingas: pc2005 has a disk space warning since ~2h ago [15:24:25] I wonder if analytics1052 is related to the wake of the switchover [15:24:43] PROBLEM - Disk space on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:51] PROBLEM - dhclient process on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:53] PROBLEM - YARN NodeManager Node-State on analytics1052 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:03] (03PS1) 10Elukey: Add the possibility to set an external database for Hue. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) [15:25:06] !log ori@tin Synchronized wmf-config/ProductionServices.php: Iee2e08df5: Fix codfw redis hostnames [no-op, already synced as live hack] (duration: 00m 36s) [15:25:09] bblack: known, no problem there, thanks (in the sense that we'll clean up some stuff later) [15:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:31] checking analytics1052 [15:26:15] paravoid: ok so, move on traffic? [15:26:43] RECOVERY - Disk space on analytics1052 is OK: DISK OK [15:26:52] RECOVERY - dhclient process on analytics1052 is OK: PROCS OK: 0 processes with command name dhclient [15:27:01] RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING [15:27:03] didn't do anything, wasn't able to ssh, temp glitch? [15:27:12] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218221 (10Addshore) [15:27:13] RECOVERY - configured eth on analytics1052 is OK: OK - interfaces up [15:27:30] bblack: yes please [15:27:31] <_joe_> paravoid, bblack I'd just love to see the wikidata alert come back [15:27:32] RECOVERY - Disk space on Hadoop worker on analytics1052 is OK: DISK OK [15:27:32] RECOVERY - Check size of conntrack table on analytics1052 is OK: OK: nf_conntrack is 0 % full [15:27:32] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:27:42] <_joe_> but nothing to interfere with traffic [15:27:42] RECOVERY - DPKG on analytics1052 is OK: All packages OK [15:27:55] weird [15:28:02] RECOVERY - Hadoop JournalNode on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode [15:28:13] RECOVERY - Hadoop NodeManager on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:28:14] _joe_: wikidata lag alert did go away [15:28:21] <_joe_> bblack: yeah just saw [15:28:22] RECOVERY - salt-minion processes on analytics1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:28:23] RECOVERY - Hadoop DataNode on analytics1052 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:28:23] RECOVERY - RAID on analytics1052 is OK: OK: optimal, 13 logical, 14 physical [15:28:32] <_joe_> gehel: wdqs still not updating? [15:28:36] elukey: analytics1052 had very high iowait in ganglia while it was dead in icinga [15:28:38] _joe_, bblack: I acked the WDQS alert [15:28:39] <_joe_> I guess it should be ok now [15:28:41] looks like icinga glitch for an52? uptime 34 days [15:28:53] hacker news is downish, silicon valley in crisis [15:28:57] but yes, it is back to normal [15:29:20] ottomata: nope. look at https://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&h=analytics1052.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [15:29:22] <_joe_> ori: we'll discover they were hosting their website on the rc feeds [15:29:22] (03PS3) 10BBlack: codfw switch: codfw text caches -> direct [puppet] - 10https://gerrit.wikimedia.org/r/283430 [15:29:30] there was clearly something going on [15:30:08] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2218234 (10Papaul) |**server name **|**rack location **| |restbase2001|B5| |restbase2002|B8| |restbase2003|C1| |restbase2004|C5| |restbase2005|D1| |restbase2006|D5| layout option |**server name **|**rack... [15:30:20] jynus: for x1 how you want to proceed? we have ROW binlog format on db2009 [15:30:22] (03CR) 10BBlack: [C: 032 V: 032] codfw switch: codfw text caches -> direct [puppet] - 10https://gerrit.wikimedia.org/r/283430 (owner: 10BBlack) [15:30:27] !log [traffic codfw switch #1] - puppet merging text caches -> direct [15:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:42] is the within-codfw working, right? [15:30:51] yes [15:31:04] !log [traffic codfw switch #1] - salting puppet change [15:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:20] volans, because at this point, I would upgrade the old master (aka failover to a 10 slave) [15:31:53] let's make a plan of things we have to do in these 48 hours [15:32:03] but this is oftopic from here [15:32:32] yes we can continue on -databases [15:32:34] it should be part of the eqiad maintenance [15:32:48] let me first give a general check to codfw [15:33:08] and maybe taking a break wehn things stabilize [15:33:23] sure [15:34:05] (03PS3) 10BBlack: codfw switch: geodns depool text services from eqiad [dns] - 10https://gerrit.wikimedia.org/r/283433 [15:34:14] !log [traffic codfw switch #1] - puppet change complete - done [15:34:18] sorry I was a bit disconnected, what is the current status, mediawiki ok, pending traffic? [15:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:37] jynus: correct [15:34:52] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [15:35:10] (03CR) 10BBlack: [C: 032] codfw switch: geodns depool text services from eqiad [dns] - 10https://gerrit.wikimedia.org/r/283433 (owner: 10BBlack) [15:35:11] <_joe_> gehel: :) [15:35:54] !log [traffic codfw switch #2] - authdns-update complete, user traffic to eqiad frontends should start dropping off now [15:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:00] _joe_: I have to admit I did not understand everything... [15:36:15] (03PS3) 10BBlack: codfw switch: esams text caches -> codfw [puppet] - 10https://gerrit.wikimedia.org/r/283431 [15:36:19] <_joe_> gehel: it's probably fed by the jobqueue or a maintenance script [15:36:27] <_joe_> which we stopped during the switchover [15:36:41] gehel: _joe_ yeh I think that is the case (if your talking about WDQS) [15:36:45] _joe_: looking at the code, it seems to call api.php ... [15:36:46] 06Operations, 06Commons, 10MediaWiki-Page-deletion: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2218251 (10matmarex) [15:36:53] bblack: shoot if you need more eyes and/or hands :) [15:37:03] <_joe_> yeah that ^^ [15:37:05] <_joe_> :) [15:37:09] gehel: oohh, what in api.php? ;) [15:37:30] so far all smooth - keep in mind all of this is pre-tested for other clusters/scenarios :) [15:37:54] addshore: something similar to curl -v -s https://www.wikidata.org/w/api.php?format=json\&action=query\&list=recentchanges\&rcdir=newer\&rcprop=title\|ids\|timestamp\&rclimit=10\&rcstart=20160404000000 [15:38:07] (03CR) 10BBlack: [C: 032 V: 032] codfw switch: esams text caches -> codfw [puppet] - 10https://gerrit.wikimedia.org/r/283431 (owner: 10BBlack) [15:38:25] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2214186 (10matmarex) [15:38:32] oooh, _joe_ recentchanges wouldn't actually be populated for the period it was broken right? [15:38:35] !log [traffic codfw switch #3] - puppet merging esams text -> codfw [15:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:36] !log [traffic codfw switch #3] - salting puppet change [15:39:37] volans, I am going to update the masters on tendril to get a better picture (it is a rown on db1011 database) [15:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:40:00] infact, per once of my test edits on test.wikipedia.org not appearing in recentchanges i'll say no, which means the WDQS is going to be missing a handfull of edits no gehel! [15:40:02] jynus: ok [15:40:31] if it actually gets the changes from recentchanges and the api in that way [15:41:32] addshore: not sure I read the code correctly, I'll check with SMalyshev when he arrives [15:41:38] !log [traffic codfw switch #3] - puppet change complete - done [15:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:00] gehel: cool, yeh I wouldn't remember how it works without diving through the code again [15:42:13] bblack: all done? [15:42:23] still seeing some traffic on eqiad imagescalers, possibly related to swift proxy processes still running after reload [15:42:31] (03PS3) 10BBlack: codfw switch: eqiad text caches -> codfw [puppet] - 10https://gerrit.wikimedia.org/r/283432 [15:43:01] mark: in practice yes, but still waiting for (a) eqiad frontend users to finish draining out from DNS TTL [15:43:06] ok [15:43:17] + (b) reconfirming #1 is definitely done before #4, so we don't cause loops [15:43:20] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2218261 (10matmarex) I can't get the full backtraces from logstash (they're truncated there), but all of these exceptions are "Co... [15:43:35] but #4 is just to catch users stuck in eqiad with bad DNS, doesn't affect much load/traffic in practice [15:43:54] i'll start preparing an update mail to be sent out [15:44:15] <_joe_> /win 17 [15:45:15] #1 confirmed [15:45:32] !log [traffic codfw switch #4] - puppet merging eqiad text -> codfw [15:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:45:41] (03CR) 10BBlack: [C: 032 V: 032] codfw switch: eqiad text caches -> codfw [puppet] - 10https://gerrit.wikimedia.org/r/283432 (owner: 10BBlack) [15:46:39] !log [traffic codfw switch #4] - salting puppet change [15:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:00] !log [traffic codfw switch #2] - confirmed bulk of traffic moved after ~10min for DNS TTL, rates levelling out on eqiad+codfw front network stats [15:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:21] !log [traffic codfw switch #4] - puppet change complete - done [15:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:42] that's it for the traffic changes [15:49:51] \o/ [15:50:34] <_joe_> \o/ [15:51:35] \o/ [15:51:57] <_joe_> we're not good at dancing, are we [15:52:13] * urandom looks at his feet [15:52:18] ,o/ ,o/ ,o/ [15:52:43] 06Operations, 10Wikidata, 05codfw-rollout: BadMethodCallException when viewing Items created around the time of the eqiad -> codfw switch - https://phabricator.wikimedia.org/T133048#2218157 (10Addshore) [15:52:45] ahahahahah [15:52:51] travolta ftw [15:52:55] godog: good one! [15:53:48] hahah thanks akosiaris \o. [15:53:50] I was going to /o/ (hey) \o\ (ho) /o/ (hey) \o\ (ho) but yours is better [15:55:13] <_joe_> akosiaris: we know you can dance... [15:55:13] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218336 (10Gehel) Looking at the code, it seems that updates are fetched by using the MW API. Doing the call manually from wdqs1001... [15:55:16] 06Operations, 10MediaWiki-Recent-changes, 07Security-General: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218337 (10matmarex) [15:56:11] <_joe_> gehel: which url for the api is used by wdqs? [15:56:15] congratulations everyone on the successful dc switch. :) [15:56:17] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218353 (10Gehel) This is related to T133053. [15:56:35] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10Wikimedia-General-or-Unknown, 07HTTPS: securecookies - https://phabricator.wikimedia.org/T119570#2218358 (10BBlack) [15:56:38] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review, 07Varnish: Mark cookies from varnish as secure - https://phabricator.wikimedia.org/T119576#2218356 (10BBlack) 05Open>03Resolved All Set-Cookie: emitted by varnish have the secure flag [15:56:58] _joe_: I reconstructed it from looking at the code (so I might be wrong) but it looks like ; curl -v -s https://www.wikidata.org/w/api.php?format=json\&action=query\&list=recentchanges\&rcdir=newer\&rcprop=title\|ids\|timestamp\&rclimit=100 [15:57:13] _joe_: gehel the query service is updating now https://grafana.wikimedia.org/dashboard/db/wikidata-query-service but if things were missing from RC then it will be missing data [15:57:38] <_joe_> addshore: ack [15:58:06] <_joe_> and thanks for looking :) [15:58:22] addshore: we need to reinstall one of the wdqs1001 server and do a full data load, so problem will be solved at that point for this server. For wdqs1002, we'll have to find a solution... [15:58:36] gehel: cool cool! [15:58:47] * addshore runs away to keep looking at https://phabricator.wikimedia.org/T133048 [15:59:31] Is there a tracking task / project for bugs related to the switch? [16:00:01] we should perhaps make one [16:00:14] for this particular date too, since there will be future switch tests [16:00:57] i mailed one last week [16:01:00] and again just now [16:01:07] #codfw-rollout [16:01:18] ok [16:01:22] it was deemed overkill to create a separate one [16:01:32] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218390 (10csteipp) [16:03:22] (03CR) 10MGChecker: [C: 04-1] [WIP] Let Wikidata editors edit at a higher rate than on other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 (owner: 10Jforrester) [16:06:52] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218396 (10IKhitron) Well, Quarry ignores queries on the lost time in recentchanges table, but has all the data in revision table. [16:06:55] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218397 (10IKhitron) Well, Quarry ignores queries on the lost time in recentchanges table, but has all the data in revision table. [16:08:46] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218323 (10matmarex) There is a script called rebuildrecentchanges.php, but it would need some adjustments to work on a time range (right now it c... [16:09:42] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: No patrolling 23 minutes after Dallas - https://phabricator.wikimedia.org/T133053#2218405 (10IKhitron) So, it will be OK! [16:14:02] (03CR) 10Dzahn: [C: 04-1] "it's called furud instead of yurud" [puppet] - 10https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718) (owner: 10Chad) [16:14:17] (03PS2) 10Chad: Use legacy key exchanges on furud, like antimony [puppet] - 10https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718) [16:14:34] (03PS3) 10Dzahn: Use legacy key exchanges on furud, like antimony [puppet] - 10https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718) (owner: 10Chad) [16:14:40] (03Draft1) 10Addshore: Add ganglia link to codfw too [software/tendril] - 10https://gerrit.wikimedia.org/r/284184 [16:15:07] (03CR) 10Dzahn: [C: 032] Use legacy key exchanges on furud, like antimony [puppet] - 10https://gerrit.wikimedia.org/r/284200 (https://phabricator.wikimedia.org/T123718) (owner: 10Chad) [16:16:03] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218422 (10Addshore) >>! In T133046#2218396, @IKhitron wrote: > Well, Quarry ignores queries on the lost time in recentchanges table... [16:21:47] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218464 (10Deskana) [16:22:37] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218323 (10Deskana) I've updated this task with some of the informati... [16:27:18] is labs replication still working? [16:28:20] Krenair: yes [16:28:40] max(rc_timestamp) from enwiki_p.recentchanges is 20160419083717 [16:29:23] on labsdb1001 [16:29:50] on labsdb1003: 20160419162942 [16:30:28] yes labsdb1001 is delayed [16:30:37] ok [16:32:07] I was away long time... Datacenter switch happend without big problems? [16:32:14] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1023 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h [16:32:14] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1033 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h [16:32:14] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1038 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h [16:32:14] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1040 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h [16:32:14] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1052 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h [16:32:14] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1058 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check, will go away in the next 48h [16:33:08] Luke081515, well we lost a load of edits from RC [16:33:52] Krenair: :-/ I already saw that at the steward channel. Apart from that all ok? [16:34:18] Luke081515: mainly, so far :) [16:34:40] great :) [16:35:45] <_joe_> Luke081515: overall it went quite well, I would say, yes [16:36:21] _joe_: Good. IIRC the next switch is tomorrow, or I'm wrong? [16:36:29] thursday [16:36:33] <_joe_> no, thursday [16:36:36] <_joe_> in 46 hours [16:36:49] <_joe_> I didn't realize it was so near :( [16:37:05] <_joe_> no time to rest on laurels it seems [16:37:22] gah, I meant thrusday, my fault ;) [16:38:41] gehel: what's the remaining traffic on elasticsearch in eqiad? [16:38:51] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1001 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check [16:38:53] mark: indexing, and maybe ttmserver [16:39:03] ok [16:40:59] ttmserver is in eqiad per earlier discussions [16:42:02] <_joe_> definitely ttmserver [16:42:03] mark: as ebernhardson said. I did a check before the switch, I could see almost only indexing traffic [16:42:09] <_joe_> Nikerabbit: indeed [16:42:38] ttmserver is still in eqiad but is mostly insignificant compared to indexing traffic [16:43:58] <_joe_> of course [16:45:22] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2218634 (10fgiunchedi) ok, thanks @papaul ! if we have to co-locate in the same rack in row C that's fine too, I'll leave it to you whether C1 or C5 [16:45:34] does anyone in here feel comfortable running a main script for wikidata to cleanup from the switchover? :P [16:46:18] <_joe_> addshore: honestly, no :P but someone with more dev knowledge maybe [16:46:37] I probably would, but I dont have access ;) [16:46:39] <_joe_> ori, AaronSchulz maybe? [16:46:46] <_joe_> addshore: which script btw? [16:46:51] <_joe_> I can take a look [16:47:02] see https://phabricator.wikimedia.org/T133048 rebuildEntityPerPage.php [16:47:31] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago [16:48:57] <_joe_> addshore: that's a wikidata specific script, I have no idea how to use it [16:48:57] (03PS1) 10Dzahn: statistics: rsync on stat1004 for stat1001 migration [puppet] - 10https://gerrit.wikimedia.org/r/284225 (https://phabricator.wikimedia.org/T76348) [16:49:04] <_joe_> but if it runs on single pages [16:49:11] <_joe_> we can test one for sure [16:49:17] * AaronSchulz knows nothing about that script [16:49:19] hmm.. what's the puppet issue with bast1001? .looking [16:49:36] <_joe_> AaronSchulz: heh it's wikidata-specific [16:49:42] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:10] eh... ok [16:50:34] <_joe_> addshore: how is that script run? [16:52:03] _joe_: should be run as any other exttension main script is run [16:52:26] <_joe_> I see it repairs all of the table [16:52:34] (03PS2) 10Dzahn: statistics: rsync on stat1004 for stat1001 migration [puppet] - 10https://gerrit.wikimedia.org/r/284225 (https://phabricator.wikimedia.org/T76348) [16:52:37] yup, but will only fill in the gaps [16:52:50] (03PS3) 10Dzahn: statistics: rsync on stat1004 for stat1001 migration [puppet] - 10https://gerrit.wikimedia.org/r/284225 (https://phabricator.wikimedia.org/T76348) [16:53:11] (03CR) 10Dzahn: [C: 032] statistics: rsync on stat1004 for stat1001 migration [puppet] - 10https://gerrit.wikimedia.org/r/284225 (https://phabricator.wikimedia.org/T76348) (owner: 10Dzahn) [16:53:25] (03PS3) 10Elukey: Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - 10https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324) [16:53:32] <_joe_> addshore: I'm honestly not that confident [16:53:49] thats fine, It can wait for someone else :) [16:54:28] * addshore may have to go look at what access he needs to run wikidata maint scripts... [16:54:47] (03PS2) 10BBlack: varnish redir: wmfusercontent.org -> www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/284112 (https://phabricator.wikimedia.org/T132452) [16:55:13] <_joe_> addshore: do you have an account in prod? [16:55:17] yup [16:55:54] !log restarting gerrit to pick up furud's rsa key [16:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:24] addshore, to run maintenance scripts it's either restricted or deployment [16:56:31] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2218678 (10Papaul) Thanks @fgiunchedi here is the final layout |**server name **|**rack location**| |restbase2007|B1| |restbase2008|C1| |restbase2009|D1| [16:56:33] *goes to look at those 2 groups* [16:56:39] normally maint scripts are run by cron, not people [16:56:42] I'm getting 503s on gerrit [16:56:44] they are on terbium [16:56:47] bblack: me too [16:56:49] <_joe_> mutante: no! [16:56:53] <_joe_> they are on wasat [16:56:55] nevermind, fixed itself [16:56:55] <_joe_> :) [16:57:01] oops, heh, of course :) [16:57:10] I logged it ;-) [16:57:56] grrrit-wm is not reporting in either [16:58:07] It always dies after a gerrit kick. [16:58:22] Krenair: in that case maybe I will put a request in for restricted [16:58:29] Krenair: Holp? grrrit-wm ^ [16:58:30] mutante: ok to merge? [16:58:51] <_joe_> addshore: request the ability to run mw maintenance scripts [16:58:53] bblack: yes please, the reason i didn't is that i got 503 from gerrit during puppet-merge because of the restart [16:58:53] btw, we're enqueing job queue jobs faster than processing them atm [16:58:55] ostriches, can't help due to https://phabricator.wikimedia.org/T132828 [16:58:56] <_joe_> it's a better request [16:59:11] queue size is increasing [16:59:14] <_joe_> mark: I suspect that has to do with some loop like the last time [16:59:14] YuviPanda: Plz ^ [16:59:24] unless you want me to elevate my tools access to projectadmin :) [17:00:21] but ops might not like that so much [17:00:24] <_joe_> ori, AaronSchulz the queue size is unsurprisingly increasing; can either of you take a look? I'd say AaronSchulz given ori is awake since... I lost count [17:01:41] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218697 (10IKhitron) Well, I made the list of missing edits, and they are not unmarked! [17:01:43] !log ytterbium: stopped puppet for a bit, testing host key mess. [17:01:47] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218698 (10IKhitron) Well, I made the list of missing edits, and they... [17:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:01:50] 06Operations, 10Ops-Access-Requests: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2218699 (10Addshore) [17:02:05] _joe_: Krenair ^^ [17:02:53] I had a reindex job in progress during the switchover. It now looks like those jobs are stuck in a job queue. [17:03:00] <_joe_> addshore: it won't happen now btw [17:03:11] yeh I know :) X days or something ;) [17:03:29] <_joe_> heh I feared I was crashing your expectations :P [17:03:54] it's not my first rodeo ;) [17:03:55] 06Operations, 10Traffic, 07HTTPS: Preload HSTS - https://phabricator.wikimedia.org/T104244#2218726 (10BBlack) [17:03:57] 06Operations, 10Traffic, 13Patch-For-Review: HSTS preload for wmfusercontent.org - https://phabricator.wikimedia.org/T132452#2218723 (10BBlack) 05Open>03Resolved a:03BBlack The redirect above worked, this is submitted to the preload list now (will take the usual lag time to make it into browsers) [17:04:11] and this is yet another behaviour of the jobqueue, went from 0 to 1.6M in 1 minute when started and then is going up [17:04:16] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218728 (10Smalyshev) Yes, WDQS uses recent changes API, so if that one is broken then updates are broken too. [17:04:40] seems like ori and AaronSchulz went missing [17:05:03] maybe Krinkle? [17:05:32] I'm guessing that team hasn't renamed into "availability" just yet ;) [17:06:33] mark: My 100% uptime is gonna suffer this month :( [17:06:46] haha [17:07:32] urandom: the SSTables alert is tripping again [17:08:21] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218736 (10Deskana) [17:08:23] stat1004 actually has Petabyte storage, wow [17:08:23] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218737 (10Deskana) [17:08:39] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218323 (10Deskana) [17:08:40] ostriches: speaking of your uptime :P -- I saw that mutante is moving gitblit to a new host; why aren't we just killing it? [17:08:41] ostriches: I can do it yeah. [17:08:41] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218112 (10Deskana) [17:08:42] !log Deleting pc1002* old binlog from pc2005 to make some space [17:08:45] i'm not sure i saw a "P" in df -h before [17:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:08:56] 06Operations, 10MediaWiki-Recent-changes, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218323 (10Deskana) [17:08:58] volans: pc2006 is warning too [17:08:58] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 05codfw-rollout: WDQS stopped updating during datacenter switch - https://phabricator.wikimedia.org/T133046#2218112 (10Deskana) [17:09:10] paravoid: yes will be next ;) thanks [17:09:46] paravoid: Phabricator doesn't index non-committed revisions yet (things in gerrit but not yet merged). That's still used for repo browsing from Gerrit. [17:09:54] pc1002, shouldn't it be pc1005? [17:10:04] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2218749 (10Ottomata) a:03Ottomata [17:10:05] We're working on a patch, but haven't figured it out 100% yet. [17:10:14] jynus: there are both [17:10:18] ha [17:10:19] those are from Jan [17:10:26] yes, when hardware upgrade [17:11:10] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2218757 (10Ottomata) Will work on this today/tomorrow. [17:11:40] jynus: I'm deleting pc1005* too, other 166GB [17:12:00] +1 [17:12:18] !log Deleting pc1005* binlog from pc2005 to make some space [17:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:24] I also moved sqldata-cache.bak to /home [17:12:46] that is the old, local, parsed articles [17:12:59] only 4-5 GB [17:13:00] ok thanks, that was just 4.5GB [17:13:11] 78% now [17:13:25] 06Operations, 10MediaWiki-Recent-changes, 07Availability, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2015-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218798 (10mark) [17:15:55] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (10matmarex) [17:16:22] meh, jenkins-bot says Verified +2 but when you want to merge "needs Verified" ...lies [17:17:27] !log Deleting pc1003* and pc1006* binlog from pc2006 to make some space [17:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:42] 06Operations, 10MediaWiki-Recent-changes, 07Availability, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2016-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2218891 (10mobrovac) [17:21:09] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218892 (10Yann) I purge https://fr.wikisource.org/wiki/MediaWiki:Sidebar and it looks OK now. [17:21:19] mutante: in those instances, refreshing the page helps [17:22:26] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218905 (10matmarex) I still see the bad sidebar viewing https://fr.wikisource.org/wiki/Page:Tolstoï_-_Le_salut_est_en_vous.djvu/55 when not logged in. Curiously,... [17:22:50] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (10bd808) I'm seeing similar issues on https://wikimediafoundation.org/wiki/Home. The sidebar was the default version for both anon and authed. A page purg... [17:23:22] mutante: /srv on stat1004 is [17:23:29] 6.8T , just saw [17:23:30] 6.8T avail [17:23:33] ja [17:23:34] :) [17:23:43] :) ok, cool, then the setup is done [17:23:52] to copy from 1001 to 1004 that is [17:24:04] also, for rsync, i see you made a new ::migration class? [17:24:34] we might want to just add stat1004 to lsit of statistics servers, and include statistics::rsync class [17:24:34] all statistics_servers are configured to be able to write to each other's /srv [17:24:35] but, whatev! [17:24:35] temp ::migration class works too [17:25:30] !log demon@tin Synchronized php-1.27.0-wmf.21/extensions/CentralAuth: forgot something (duration: 00m 42s) [17:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:25:58] ok, then [17:25:59] elukey: ^ [17:27:50] (03Abandoned) 10Dzahn: stats: adjust rsyncd pathes to use petabyte mount [puppet] - 10https://gerrit.wikimedia.org/r/284235 (https://phabricator.wikimedia.org/T76348) (owner: 10Dzahn) [17:28:07] (03CR) 10Dzahn: "nevermind, /srv _is_ large enough, so this is done" [puppet] - 10https://gerrit.wikimedia.org/r/284235 (https://phabricator.wikimedia.org/T76348) (owner: 10Dzahn) [17:28:23] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10hardware-requests, 10netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2218956 (10Ottomata) [17:33:58] 06Operations, 10Ops-Access-Requests: Requesting access to run mw maintenance scripts - https://phabricator.wikimedia.org/T133066#2218998 (10Krenair) So restricted access, basically? [17:34:39] thanks mutante, didn't know about the rsync sorry :( [17:34:41] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2219003 (10bd808) >>! In T133069#2218908, @bd808 wrote: > Now I have a strange reproduction case for anons: > * Hit https://wikimediafoundation.org/wiki/Home and s... [17:35:10] (03PS2) 10BBlack: Common VCL: remove wikimedia.org subdomain HTTPS redirect exception [puppet] - 10https://gerrit.wikimedia.org/r/284106 (https://phabricator.wikimedia.org/T102826) [17:36:11] enqueue: 1141607 queued; 2250154 claimed (304734 active, 1945420 abandoned); 0 delayed [enwiki] [17:36:25] (03PS4) 10Elukey: Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - 10https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324) [17:36:27] <_joe_> so it's mostly enwiki? [17:36:29] paravoid: pong [17:36:40] next wiki is only 6k [17:38:13] _joe_: is there a generic maintenance host name (e.g. not terbium) [17:38:20] <_joe_> nope [17:38:30] <_joe_> wasat is the new terbium [17:38:52] (03CR) 10Elukey: [C: 032] Add delaycompress to ganglia-web's logrotate to avoid daily cronspam. [puppet] - 10https://gerrit.wikimedia.org/r/284133 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [17:39:54] <_joe_> ostriches: I was thinking, you should use mira for scap now [17:41:21] I thought about that the second after I sync'd. [17:41:26] Muscle memory [17:41:38] <_joe_> yeah it's not properly switched over, though [17:41:52] <_joe_> I can switchover if we think it's needed, but I don't think so [17:41:59] <_joe_> since we froze changes this week [17:44:54] <_joe_> so outstanding problems are: 1) Missing RC changes 2) Some wikidata articles failing to render (addshore has a solution, see https://phabricator.wikimedia.org/T133048, I just don't feel confident running that script) 3) Sidebar not properly updated 4) 1 M enwiki jobs [17:46:13] do we have tickets for each? [17:46:17] I know there's one for RC [17:46:29] I know addshore has an access request for maint scripts [17:46:52] aha, sidebar was filed: https://phabricator.wikimedia.org/T133069 [17:47:09] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2219074 (10Dzahn) We now have an rsyncd running on stat1004, ready to accept data from stat1001, it will be in /srv/stat1001/ , there are 3 modules, one for home, one for srv and... [17:47:17] not sure about 1m jobs [17:49:08] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219084 (10Dzahn) >>! In T128968#2216663, @Peachey88 wrote: > That is offtopic for this task, Please file a seperate task This has been opened before as T93523 . It wa... [17:49:52] 06Operations, 10MediaWiki-Recent-changes, 07Availability, 07Security-General, 05codfw-rollout: Special:RecentChanges contains no entries from 14:48 - 15:10 UTC on 2016-04-19 due to Dallas data centre migration - https://phabricator.wikimedia.org/T133053#2219089 (10matmarex) a:03matmarex I'm going to wo... [17:49:57] !log setting binlog_format=ROW on old x1-master at eqiad (db1029) to reenable replication [17:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:37] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219092 (10Sjoerddebruin) Note: the domain seems to be registered already. But all I get is MacKeeper shit... [17:50:51] (03PS1) 10Volans: MariaDB: Fix pt-heartbeat for x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/284243 (https://phabricator.wikimedia.org/T124699) [17:51:02] <_joe_> addshore: still around? [17:51:06] yup [17:51:26] <_joe_> addshore: that script should run just on wikidata, right? [17:51:41] <_joe_> so --wiki wikidatawiki if I'm not wrong [17:52:02] yup [17:52:03] (03PS1) 10Rush: kubernetes to v1.2.2wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/284244 [17:52:05] and that should be it [17:52:39] <_joe_> addshore: I'm going to run that then :) [17:52:40] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219098 (10Dzahn) Yep, looks like it's already gone. , registered to a Mr. Gerbert in London. [17:52:51] _joe_: awesome, and I'll be here throughout :) [17:52:51] (03CR) 10Yuvipanda: [C: 031] kubernetes to v1.2.2wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/284244 (owner: 10Rush) [17:53:17] RECOVERY - MySQL Slave Running on db1029 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error [17:53:36] ^seems to work, and recovering from lag quickly [17:53:42] cool [17:53:44] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2219099 (10mmodell) One interesting thing that phabricator seems to be implementing it's own load balancing for repositories. I think the idea is that any fr... [17:56:56] (03CR) 10Rush: [C: 032 V: 032] kubernetes to v1.2.2wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/284244 (owner: 10Rush) [17:57:43] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2219108 (10Krinkle) p:05Triage>03Normal [17:57:58] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:58:34] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2219113 (10Dzahn) @Robh @Papaul do we have a server in codfw that matches "4-8 core CPU, 16-32G ram and 500g of non-mirrored storage." and could be used for t... [17:58:48] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:04] (03PS2) 10Volans: MariaDB: Fix pt-heartbeat for x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/284243 (https://phabricator.wikimedia.org/T124699) [18:00:04] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2219126 (10mmodell) [18:00:08] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:00:23] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2219132 (10bd808) Purging https://wikimediafoundation.org/w/index.php?title=Questions_for_Wikimedia%3F&redirect=no fixed that one instance. This is easily explaine... [18:00:26] <_joe_> !log running rebuildEntityPerPage.php on wikidata, T133048 [18:00:27] T133048: BadMethodCallException when viewing Items created around the time of the eqiad -> codfw switch - https://phabricator.wikimedia.org/T133048 [18:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:33] (03CR) 10Volans: [C: 032] "changes looks good https://puppet-compiler.wmflabs.org/2503/" [puppet] - 10https://gerrit.wikimedia.org/r/284243 (https://phabricator.wikimedia.org/T124699) (owner: 10Volans) [18:01:08] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:02:47] aqs is still suffering for the Cassandra timeouts, new hardware will arrive soon [18:05:28] RECOVERY - MariaDB Slave Lag: x1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 0.07 seconds [18:05:29] (03PS1) 10Ottomata: Add analytics1003 in netboot.cfg and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/284249 (https://phabricator.wikimedia.org/T130840) [18:05:45] RECOVERY - MariaDB Slave Lag: x1 on db2008 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [18:05:47] i cannot login into horizon.wikimedia.org - is that part of the switchover? [18:06:13] 06Operations, 10Wikidata, 05codfw-rollout: BadMethodCallException when viewing Items created around the time of the eqiad -> codfw switch - https://phabricator.wikimedia.org/T133048#2219157 (10Addshore) 05Open>03Resolved a:03Addshore Looks resolved to me! [18:06:40] yurik: no it is not [18:06:56] RECOVERY - MariaDB Slave Lag: x1 on db2009 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [18:07:08] RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [18:07:17] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:07:34] ottomata: "Throughput of EventLogging NavigationTiming events" ? [18:07:41] Krenair: that wikidata one is ticket off the list :) [18:07:44] looking [18:07:48] urandom: SSTables alert? [18:07:49] akosiaris, i logged in about an hour ago, now i tired it again and it fails [18:07:57] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:08:20] btw, paravoid looking, the proper folks to ping about that are performance folks [18:08:24] (03PS1) 10Yuvipanda: tools: Add a null check for manifest version checking [puppet] - 10https://gerrit.wikimedia.org/r/284250 [18:08:30] that comes from the statsv instance running on hafnium [18:08:52] ottomata: Is there an example in prod that uses get_simple_consumer in the way you described yesterday? I know it's a trivial parameter, but I can't statsv easily so would be nice to have a reference to something. and not be the only one using it that way [18:08:58] (03PS2) 10Dzahn: kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - 10https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729) [18:09:17] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:09:29] ACKNOWLEDGEMENT - MySQL Replication Heartbeat on db1029 is CRITICAL: NRPE: Unable to read output Volans T133057 broken check [18:09:35] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2219177 (10RobH) a:03mark No need to ping Papaul, he doesn't have any involvement in #hardware-requests. (Its primarily myself, and if I am out sick, then... [18:09:37] hm, not that I know of Krinkle, the only other prod use of pykafka I know of is the eventlogging handler. it uses the pykafka balancedConsumer though [18:09:39] not the simple consumer [18:09:40] but [18:09:42] (03PS2) 10Yuvipanda: tools: Add a null check for manifest version checking [puppet] - 10https://gerrit.wikimedia.org/r/284250 [18:09:43] the configs should be the same [18:09:44] (03PS3) 10Dzahn: kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - 10https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729) [18:09:46] mostly [18:09:49] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add a null check for manifest version checking [puppet] - 10https://gerrit.wikimedia.org/r/284250 (owner: 10Yuvipanda) [18:09:58] addshore: https://phabricator.wikimedia.org/T133000 [18:09:58] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:10:02] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2219181 (10RobH) [18:10:07] but, the example uses the eventlogging URI handler scheme for config and passes them via kwargs [18:10:11] addshore: I filed that yesterday before the switchover started [18:10:15] (03PS4) 10Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) [18:10:15] so i can't really point you to a code example directly [18:10:33] Krinkle: ooh, at a first glance that looks unrelated [18:10:41] Krinkle: i strangely see this in service statsv status output [18:10:42] UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 40: invalid start byte [18:10:47] addshore: Well, it makes RC ununsable, so it should be fixed too :) [18:10:54] para void pinged about the icinga alert showing unknown, that's why i'm looking [18:10:57] ahh okay! [18:11:03] it's a fatal [18:11:19] got to dig into a possible solution for another wikidata thing first as fallout from the switch [18:11:24] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2219189 (10Papaul) @Dzahn WMF5849 rbf2001 A5 Dell PowerEdge R420 Intel® Xeon® Processor E5-2440 3.00 6 cores Yes 32 GB RAM (2) 500GB SATA WMF3641 B5 Dell Powe... [18:11:57] Do we have a timestamp at which point jobs will not have being able to be queued and a timestamp that the queues started working again? [18:14:36] (03PS5) 10Dzahn: kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) [18:14:40] (03CR) 10Yuvipanda: [C: 04-1] "I'm pretty sure we shouldn't be using puppet's auth.conf - that'll diverge us from prod's puppetmaster a lot more than necessary on import" [puppet] - 10https://gerrit.wikimedia.org/r/284103 (owner: 10Andrew Bogott) [18:16:06] Krinkle: are you plannign on just adding auto_offset_reset=-1, or are you going to tell it to commit offsets too? [18:16:18] probably auto_offset_reset=-1 [18:16:23] i think that makes sense for statsv [18:16:25] cool [18:16:36] I'm pretty sure if you just add that to the get_simple_consumer call [18:16:44] it will just work ™ [18:17:11] (03CR) 10Andrew Bogott: "Auth.conf is nonetheless present on the puppetmaster. With it or without it, puppet rejects queries to that url unless it is explicitly p" [puppet] - 10https://gerrit.wikimedia.org/r/284103 (owner: 10Andrew Bogott) [18:17:33] (03CR) 10Dzahn: [C: 032] kraz.codfw.wmnet -> kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/284116 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [18:18:28] heh, Krinkle not sure what is wrong with statsv on hafnium right now [18:18:28] (03CR) 10Yuvipanda: "The current auth.conf is a strange beast - if you look at auth.conf.orig, that was what was there before, and probably was far more permis" [puppet] - 10https://gerrit.wikimedia.org/r/284103 (owner: 10Andrew Bogott) [18:18:34] i would restart it, but... :p [18:18:42] I'm not aware of there being an issue [18:18:51] I noticed you got pinged earlier by someone with the name of that task [18:18:55] what is this about? [18:18:58] I couldn't find it in backscroll [18:18:59] *scrolls up* there is nothing in the SAL about when jobchron came back :/ [18:19:05] (03PS4) 10Dzahn: kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - 10https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729) [18:19:24] Krinkle: about https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=graphite1001&service=Throughput+of+EventLogging+NavigationTiming+events [18:19:31] (03CR) 10Dzahn: [C: 032] kraz.codfw.wmnet -> kraz.wm.org, needs public IP [puppet] - 10https://gerrit.wikimedia.org/r/284115 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [18:19:36]   UNKNOWN   (for 0d 20h 19m 19s) [18:19:37] ottomata: Yes, please don't restart. I'd rather have it shutdown indefinitely in that case. All my graphs are beyond useless because of the previous incident. [18:19:44] Gonna have to figure a way to wipe that in Graphite [18:20:05] Krinkle: hm, eyah. if you submit to graphite directly you can set timestamp [18:20:18] maybe you can do that and use timestamp of event in statsv? meh, maybe you don't have that [18:20:27] statsv has no concept of timestamps [18:20:28] Krinkle: para void just pinged me about that icinga alert [18:20:37] it buffers all incoming packets and flushes an aggregate once per minute to graphite [18:20:37] right ja, its supposed to be statsd :9 [18:20:38] :p [18:20:48] ottomata: The old one? [18:20:53] happening now [18:20:55] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=graphite1001&service=Throughput+of+EventLogging+NavigationTiming+events [18:20:56] link? [18:21:05] ^^ [18:21:07] and on hafnium: [18:21:19] sudo service statsv status [18:21:19] ... [18:21:24] UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 40: invalid start byte [18:21:27] * Krinkle doesn't have sudo there [18:21:30] https://www.wikidata.org/wiki/Wikidata_talk:Main_Page#Sidebar It would appear the link to the mainpage on wikidata now links to the wrong place, could this really be codfw fallout... [18:21:43] wha?! [18:21:44] heh [18:22:05] dunno what data its loading [18:22:07] but its stuck :/ [18:22:17] data = json.loads(raw_data) is throwing a decode exception [18:22:44] ottomata: Yeah, it's quite possible I guess. When I was tailing statsv from kafka the other day I also got a half packet of incomplete json [18:22:45] Krinkle: should we just work now to make statsv use auto_offset_reset=-1, deploy that change, and then restart? [18:22:59] It's been known to happen a few times. I saw that Ori's been patching various of our consumers to catch parse errors [18:23:01] but it's annoying. [18:23:19] ottomata: I guess, yeah, that sounds good. But I don't see an issue yet though. [18:23:26] Data seems to be coming in from where I'm looking [18:23:39] oh [18:23:39] hm [18:23:41] ja? [18:23:48] ok maybe its not dead then [18:23:50] both eventlogging and statsv [18:23:55] ottomata: rememebr navtiming is not statsv [18:24:07] eventlogging ->navtiming [18:24:11] HMMMM [18:24:13] statsv is well, statsv [18:24:22] not related to navtiming or eventlogging [18:24:25] yeah ok looking at alert, eyah, its based on kafka topic not statsd stuff [18:24:26] sorry [18:24:33] * Krinkle shatters reality [18:24:44] HAHAH [18:24:45] DUH [18:24:46] whoops [18:24:53] getting my wires crossed here [18:25:17] hm, ok, the data for navtiming is fine in el [18:25:17] https://grafana.wikimedia.org/dashboard/db/performance-metrics?from=now-1h (eventlogging navtiming -> statsd -> graphite) [18:25:18] RECOVERY - Restbase root url on restbase1014 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.043 second response time [18:25:20] and kafka [18:25:28] https://grafana.wikimedia.org/dashboard/db/media?from=now-1h (statsv -> statsd) [18:25:29] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [18:25:31] both seem fine [18:25:43] ok looks like alert is faulty then, will look into it, sorry for dragging you into that [18:25:56] no worries [18:26:14] btw, scap sudo, I can't even ssh into hafnium apparently. [18:26:16] scrap* [18:26:34] not that I need to [18:27:02] ha, i think you should be able to! [18:27:08] ahh Krenair was there a ticket for that sidebar thing? [18:27:13] who's maintaining statvs then :) [18:27:20] addshore: https://phabricator.wikimedia.org/T133069 [18:27:25] !log kraz.codfw, reinstalling as kraz.wikimedia [18:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:28:04] ottomata: performance team yes. The're both in puppet/files/webperf/ [18:29:08] PROBLEM - Host kraz is DOWN: PING CRITICAL - Packet loss = 100% [18:29:13] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (10Addshore) Vaguely similar, the link to the main page for wikidata.org now links to the incorrect place. See https://www.wikidata.org/wiki/Wikidata_talk... [18:33:21] AaronSchulz: hi, so if you run the MatmaRex script, you need to do it on wasat.codfw.wmnet, not on terbium. [18:35:03] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (10Legoktm) I think at some point MessageCache failed and started returning the defaults for everything, which ended up getting cached by other things (sid... [18:36:57] mutante: Hm.. I didn't realise until now that kraz is in codfw [18:38:29] isnt that a good thing? [18:39:56] (03PS1) 10Eevans: raise highest SSTables (again) [puppet] - 10https://gerrit.wikimedia.org/r/284256 [18:40:11] paravoid: ^^ [18:41:03] (03PS2) 10Faidon Liambotis: raise highest SSTables (again) [puppet] - 10https://gerrit.wikimedia.org/r/284256 (owner: 10Eevans) [18:41:23] paravoid: it would be great if there was something in between, nothing, and annoy everyone [18:41:25] (03PS3) 10Faidon Liambotis: Raise highest SSTables thresholds (again) [puppet] - 10https://gerrit.wikimedia.org/r/284256 (owner: 10Eevans) [18:41:40] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Raise highest SSTables thresholds (again) [puppet] - 10https://gerrit.wikimedia.org/r/284256 (owner: 10Eevans) [18:42:38] paravoid: the answer seems to be, "routinely put eyeballs on a bunch of graphs", which is... disappointing [18:45:07] greg-g, is it ok to make labs-only config changes this week? [18:45:19] and by labs i mean betacluster :) [18:45:48] * YuviPanda pats yurik [18:45:49] (03PS1) 10Ottomata: Fix eventlogging_NavigationTiming_throughput again - need sumSeries() [puppet] - 10https://gerrit.wikimedia.org/r/284257 [18:46:03] yurik: am pretty sure greg-g is on leave now. [18:46:22] YuviPanda, ok, do you know if there are any restrictions on beta cluster changes? [18:46:50] yurik: I do not know, sorry. ask in #wikimedia-releng maybe [18:51:12] (03CR) 10Ottomata: [C: 032] Fix eventlogging_NavigationTiming_throughput again - need sumSeries() [puppet] - 10https://gerrit.wikimedia.org/r/284257 (owner: 10Ottomata) [18:51:23] 06Operations: Something in WMF infrastructure corrupts responses with certain lengths - https://phabricator.wikimedia.org/T132159#2219457 (10Anomie) [18:51:28] (03PS2) 10Ottomata: Add analytics1003 in netboot.cfg and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/284249 (https://phabricator.wikimedia.org/T130840) [18:51:36] (03CR) 10Ottomata: [C: 032 V: 032] Add analytics1003 in netboot.cfg and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/284249 (https://phabricator.wikimedia.org/T130840) (owner: 10Ottomata) [18:51:56] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (10Joe) So, during the switchover we first wiped the codfw memcached clean, then when moving the traffic over we had a temporary overload of the externalst... [18:58:00] (03PS1) 10Dzahn: install: update MAC address of kraz [puppet] - 10https://gerrit.wikimedia.org/r/284259 (https://phabricator.wikimedia.org/T123729) [18:58:59] (03PS1) 10Eevans: Disable RESTBase highest max SSTables per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/284262 [18:59:15] (03CR) 10Dzahn: [C: 032] install: update MAC address of kraz [puppet] - 10https://gerrit.wikimedia.org/r/284259 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [19:03:12] 06Operations, 10EventBus: setup/deploy conf200[1-3] - https://phabricator.wikimedia.org/T127344#2219493 (10RobH) [19:03:14] 06Operations, 10Analytics-Cluster, 10EventBus, 06Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2219495 (10RobH) [19:03:17] 06Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2219491 (10RobH) 05stalled>03Resolved As this task has had systems allocated, and setup is via T131959, resolving this request. [19:03:55] 06Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2219500 (10RobH) [19:04:59] 06Operations, 06Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2219505 (10RobH) [19:05:02] 06Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2219503 (10RobH) 05stalled>03Resolved This request has been fulfilled via order on #procurement task T127508. Resolving this #hardware-requests task. [19:05:47] 06Operations, 10RESTBase, 10hardware-requests, 13Patch-For-Review: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2219514 (10RobH) [19:05:50] 06Operations, 10RESTBase, 10hardware-requests: 3x additional SSD for restbase hp hardware - https://phabricator.wikimedia.org/T126626#2219512 (10RobH) 05Open>03Resolved This was resolved awhile ago, and this task was overlooked (as the sub-tasks had the actual work performed on them.) [19:06:39] !log purging sidebar cache across all wikis (T133069) [19:06:40] T133069: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069 [19:06:40] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#2219524 (10RobH) [19:06:42] 06Operations, 10RESTBase, 10hardware-requests, 13Patch-For-Review: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2219522 (10RobH) 05Open>03Resolved These systems are being replaced via the sub-tasks. Since the hardware request is granted, I'm... [19:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:46] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219528 (10jayvdb) I cant see T93523. It is private? Can it be made public, or should we recreate a new task about that problem. [19:07:20] (03PS1) 10Dzahn: add tegmen and einsteinium to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023) [19:10:53] 06Operations, 10Traffic, 06WMF-Legal, 10domains: Register nlwikipedia.org to prevent squatting - https://phabricator.wikimedia.org/T128968#2219572 (10Dzahn) Yes, it has a custom policy. It has been created by @Glaisher. I modified the custom policies by adding your user manually. Try again now? [19:12:11] !log Cleared enwiki 'enqueue' queue (T133089) [19:12:12] T133089: enwiki "enqueue" queue showed corruption - https://phabricator.wikimedia.org/T133089 [19:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:40] 06Operations, 10EventBus, 06Services, 10hardware-requests: 4 more Kafka brokers, 2 in eqiad and 2 codfw - https://phabricator.wikimedia.org/T124469#2219593 (10RobH) irc update: In triaging the #hw-requests, I've checked with @ottomata. This needs to have further investigation done, so I'm keeping it assig... [19:25:04] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2219659 (10Legoktm) The script completed in about 4 minutes. Now we need a varnish purge for every page cached after the switchover till my script finished. [19:26:29] 06Operations, 10RESTBase-Cassandra, 06Services: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2219667 (10Eevans) [19:27:14] (03PS2) 10Eevans: Disable RESTBase highest max SSTables per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/284262 (https://phabricator.wikimedia.org/T133091) [19:29:58] 06Operations, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Highest SSTables / read thresholds - https://phabricator.wikimedia.org/T133091#2219709 (10Eevans) [19:30:29] 06Operations, 13Patch-For-Review, 07developer-notice, 07notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2219712 (10Dzahn) reinstalled with public IP as kraz.wikimedia.org, in puppet and up and running. [19:35:14] (03CR) 10Andrew Bogott: "auth.conf.orig ends with:" [puppet] - 10https://gerrit.wikimedia.org/r/284103 (owner: 10Andrew Bogott) [19:36:09] (03CR) 10Dzahn: [C: 04-1] add tegmen and einsteinium to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [19:38:26] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Puppet has 3 failures [19:39:38] (03PS1) 10Jcrespo: Revert "Depool one db server from each shard as a backup" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284271 [19:39:46] (03PS2) 10Jcrespo: Revert "Depool one db server from each shard as a backup" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284271 [19:39:50] (03PS1) 10Ottomata: Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284272 (https://phabricator.wikimedia.org/T130840) [19:40:59] (03PS1) 10Dzahn: ircserver/irc_echo: use systemd provider if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729) [19:41:06] (03CR) 10jenkins-bot: [V: 04-1] Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284272 (https://phabricator.wikimedia.org/T130840) (owner: 10Ottomata) [19:41:40] (03CR) 10Jcrespo: [C: 032] Revert "Depool one db server from each shard as a backup" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284271 (owner: 10Jcrespo) [19:42:17] (03PS2) 10Ottomata: Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284272 (https://phabricator.wikimedia.org/T130840) [19:42:28] (03PS2) 10Dzahn: ircserver/irc_echo: use systemd provider if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729) [19:42:46] (03CR) 10Dzahn: [C: 032] ircserver/irc_echo: use systemd provider if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [19:43:36] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Revert "Depool one db server from each shard as a backup" (duration: 00m 27s) [19:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:15] (03CR) 10Ottomata: [C: 032] Include analytics_cluster::client and analytics_cluster::database::meta roles on analytics1003 [puppet] - 10https://gerrit.wikimedia.org/r/284272 (https://phabricator.wikimedia.org/T130840) (owner: 10Ottomata) [19:44:24] 06Operations: Investigate idle appservers in codfw - https://phabricator.wikimedia.org/T133093#2219746 (10Southparkfan) [19:45:30] (03PS3) 10Dzahn: ircserver/irc_echo: use systemd provider if on jessie [puppet] - 10https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729) [19:45:35] (03PS2) 10Dzahn: add tegmen and einsteinium to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023) [19:48:59] (03CR) 10Dzahn: "noop on argon - now needs unit files on kraz" [puppet] - 10https://gerrit.wikimedia.org/r/284273 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [19:49:19] (03PS3) 10Dzahn: add tegmen and einsteinium to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023) [19:49:50] (03CR) 10Dzahn: [C: 032] "just start by adding them with base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/284264 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [19:51:10] ACKNOWLEDGEMENT - puppet last run on kraz is CRITICAL: CRITICAL: Puppet has 5 failures daniel_zahn T123729 [19:51:28] !log ori@tin Synchronized php-1.27.0-wmf.21/maintenance/rebuildrecentchanges.php: Ie9799f5ea: rebuildrecentchanges: Allow rebuilding specified time range only (duration: 00m 28s) [19:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:55] 06Operations: Investigate idle appservers in codfw - https://phabricator.wikimedia.org/T133093#2219824 (10mark) p:05Triage>03Lowest [19:53:14] !log staggered varnish bans for 'obj.http.server ~ "^mw2.+"' as a workaround for T133069 [19:53:14] T133069: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069 [19:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:46] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures [19:57:21] (03PS1) 10Ottomata: analytics1015 -> analytics1003 migration [puppet] - 10https://gerrit.wikimedia.org/r/284276 (https://phabricator.wikimedia.org/T130840) [19:57:44] 06Operations, 07Icinga, 13Patch-For-Review: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2219847 (10Dzahn) >>! In T125023#2198454, @akosiaris wrote: > @dzahn. We 've already got replacement boxes. > But don't just reuse einsteinium to replace neon, the idea is to manage to kil... [19:57:50] (03CR) 10Ottomata: [C: 04-1] "To be merged during migration" [puppet] - 10https://gerrit.wikimedia.org/r/284276 (https://phabricator.wikimedia.org/T130840) (owner: 10Ottomata) [19:58:05] (03PS1) 10Dzahn: icinga: put role on einsteinium for testing [puppet] - 10https://gerrit.wikimedia.org/r/284277 (https://phabricator.wikimedia.org/T125023) [19:58:43] bblack: question if you guys are not in the midst of switchover crazyness [19:59:20] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: codfw: (1) phabricator host (backup node) - https://phabricator.wikimedia.org/T131775#2219859 (10RobH) @papaul: Daniel shouldn't have pinged you for this, as I handle the #hardware-requests, you can disregard. Thanks! [20:00:57] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10hardware-requests, and 2 others: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2219864 (10Ottomata) Ok, we are ready to proceed. Plan here: https://etherpad.wikimedia.org/p/analytics-meta 1. stop camus early... [20:01:54] (03PS2) 10Andrew Bogott: Updates to nova policy.json: [puppet] - 10https://gerrit.wikimedia.org/r/284233 (https://phabricator.wikimedia.org/T132187) [20:05:43] nuria_: ? [20:05:51] bblack: yessir, question: [20:06:07] bblack: http headers when read by varnish are case sensitive correct? [20:06:10] (03Abandoned) 10Thcipriani: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280456 (https://phabricator.wikimedia.org/T130514) (owner: 10Thcipriani) [20:06:26] bblack: depends... [20:06:40] (03PS2) 10Mattflaschen: Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) [20:06:51] bblack: ajam, depends on ..? [20:07:06] well a lot of things [20:07:22] the values are going to be case-sensitive, unless we do a case-insensitive regex match [20:07:24] jynus, I know you're probably really busy with the datacenter switchover, but https://gerrit.wikimedia.org/r/#/c/282440/ could use review whenever you are able to get to it. Let me know if I can clarify anything. [20:07:40] I don't think the keys are sensitive (req.http.host and req.http.Host should have the same meaning) [20:07:51] matt_flaschen, while that is great, worse timing possible [20:07:56] but there are probably other ways you could mean that question too [20:08:09] bblack: right, so this by default [20:08:10] https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/analytics.inc.vcl.erb#L162 [20:08:27] Yeah, doesn't have to be today, or even this week. Just wanted to reach out now that you're back. [20:08:44] nuria_: the line you linked, the case of 'X-Analytics' doesn't matter [20:08:44] bblack: you think is not case sensitive? [20:09:13] !log ran `mwscript rebuildrecentchanges.php --wiki=testwiki --from=20160419144741 --to=20160419151018` [20:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:33] nuria_: some days I'm not even sure whether I'm breathing real air, nothing is 100% :) [20:09:52] nuria_: but I'm pretty sure in the header name matches for req.http.FoO, case does not matter [20:10:07] bblack: see above -- I'm banning caches for pages served by mw2xxx [20:10:17] bblack: ok, req and resp, right? [20:10:43] backend codfw is done, there was a small spike on appserver traffic that remained for a while [20:10:54] it's dropped a bit now, still a little more elevated than normal [20:11:05] I just banned it from ulsfo backends slowly [20:11:12] ban? [20:11:14] yeah [20:11:19] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2219937 (10Legoktm) [20:11:28] yeah ok [20:11:41] sorry, I'm still catching up. if what you're doing is working, keep at it :) [20:11:46] obj.http.server ~ "^mw2.+" [20:11:47] bblack: sidebar HTML was bad, had to ban 'obj.http.server ~ '^mw2.*"' [20:12:41] (T133069 is the task) [20:12:41] T133069: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069 [20:13:12] yeah I've seen the task [20:13:33] please don't close it after the purge though, because IMHO the code's behavior is wrong regardless of any fix we do here [20:13:51] <_joe_> bblack: I already said that too :) [20:14:30] nuria_: yes, req and resp are the same. [20:14:58] I wouldn't :) [20:17:11] (03PS1) 10Dzahn: ircserver: add systemd unit file and conditionals [puppet] - 10https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) [20:20:21] of course https://wikitech.wikimedia.org/wiki/Varnish#How_to_execute_a_ban_across_a_cluster is eqiad-specific heh [20:21:22] heh [20:21:26] I'm not following that exactly anyway [20:21:36] I'm not doing "not codfw", I'm going site per site to be on the safe side [20:21:49] ok [20:21:54] (03PS2) 10Dzahn: ircserver: add systemd unit file and conditionals [puppet] - 10https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) [20:22:07] well the 3x commands there are still the "right thing", just s/eqiad/codfw/ [20:22:09] it doesn't look like anything we can't handle atm, but it's still a considerable amount of extra traffic [20:22:20] you can break them into DC sub-steps, but not change the ordering [20:22:22] (much) [20:22:33] I know :) [20:23:12] for bans that aren't super-time-critical, spacing out BE from FE can really reduce the impact too [20:23:25] that's what i'm doing [20:23:42] spacing out codfw be from the rest of the bes, and then fes [20:24:13] so that the other sites and frontends can absorb a little this extra load [20:24:28] yeah [20:24:30] this isn't super-time-critical, just annoying I assume [20:24:55] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Application+servers+codfw&m=cpu_report&s=by+name&mc=2&g=network_report [20:25:14] that's not so bad [20:25:15] it's 10-15% [20:25:40] (03PS3) 10Dzahn: ircserver: add systemd unit file and conditionals [puppet] - 10https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) [20:25:49] (03PS3) 10Volans: MariaDB: complete TLS and master configuration [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) [20:27:19] (03CR) 10Dzahn: "noop on argon http://puppet-compiler.wmflabs.org/2505/" [puppet] - 10https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [20:27:29] (03CR) 10Dzahn: [C: 032] ircserver: add systemd unit file and conditionals [puppet] - 10https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [20:28:31] (03CR) 10Andrew Bogott: [C: 032] Updates to nova policy.json: [puppet] - 10https://gerrit.wikimedia.org/r/284233 (https://phabricator.wikimedia.org/T132187) (owner: 10Andrew Bogott) [20:29:11] so reliable that somebody merges while you are waiting for jenkins [20:29:18] (03PS4) 10Dzahn: ircserver: add systemd unit file and conditionals [puppet] - 10https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) [20:29:38] (03CR) 10Dzahn: [V: 032] ircserver: add systemd unit file and conditionals [puppet] - 10https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [20:31:09] traffic is entirely back to regular levels now [20:31:16] bans on ulsfo/eqiad backends didn't even make a dent [20:31:18] (03CR) 10Dzahn: "and next issue is now: Could not find dependency File[/etc/init/ircd.conf] :p" [puppet] - 10https://gerrit.wikimedia.org/r/284293 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [20:32:30] bblack: did you have a way to ban specific time ranges? [20:32:35] 06Operations, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2218817 (10BBlack) >>! In T133069#2219464, @Joe wrote: > So, during the switchover we first wiped the codfw memcached clean, then when moving the traffic over we h... [20:32:40] I remember you saying that you had tried something before but wasn't sure if it worked or something [20:32:44] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [20:33:13] paravoid: yeah, I'm not sure if it works [20:33:28] I think I used obj.http.Date? [20:33:39] there's a backend-timing header too, it contains a timestamp from the backend [20:33:54] I mean it should work, but in the scenario I once tried it, I wasn't sure of the result [20:34:14] and yeah there's now that: [20:34:15] < Backend-Timing: D=53212 t=1461087447104093 [20:34:28] where t is epoch time generated on the MW side I believe [20:34:38] yes [20:34:41] cool [20:34:49] so yeah, next time let's try that instead [20:34:58] far less objects to ban :) [20:35:34] (03PS1) 10Dzahn: ircserver: fix dependencies for running on jessie [puppet] - 10https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729) [20:35:42] well, if we're confident about the window during which the ES/mc issue could cause large scale defaulting problems [20:35:51] then yeah we could've been more accurate there [20:35:57] (03CR) 10jenkins-bot: [V: 04-1] ircserver: fix dependencies for running on jessie [puppet] - 10https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [20:35:59] (03PS2) 10Dzahn: ircserver: fix dependencies for running on jessie [puppet] - 10https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729) [20:39:06] 06Operations: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#2220109 (10Dzahn) since these are dnsrecursors (i addition to urldownloader), what steps have to be taken before one of them can be taken down for reinstall? any? [20:40:10] (03CR) 10Dzahn: [C: 032] "noop on argon http://puppet-compiler.wmflabs.org/2506/" [puppet] - 10https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [20:44:02] !log on all wikis, deleting from recentchanges where rc_timestamp > 20160419144741 and rc_timestamp < 20160419151018 [20:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:30] (03CR) 10Dzahn: "service would be running now if it wasn't for the next problem: python-irclib doesnt exist here" [puppet] - 10https://gerrit.wikimedia.org/r/284343 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [20:44:38] 06Operations, 10MediaWiki-Cache, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2220141 (10mark) [20:46:33] fucking salt [20:46:36] 06Operations, 10DBA, 13Patch-For-Review: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#2220150 (10Volans) When rolling restart also check the error log, if too big let's rotate it and compress/delete the old one bas... [20:47:11] wtf [20:47:17] cp1008.wikimedia.org: [20:47:17] pc1006.eqiad.wmnet: Minion did not return. [No response] [20:47:17] wtp2007.codfw.wmnet: Minion did not return. [No response] [20:47:24] 06Operations, 13Patch-For-Review, 07developer-notice, 07notice: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2220156 (10Dzahn) The IRCd service could be starting on jessie now, the unit file is there, the dependencies are adjusted if on jessie, but the next proble... [20:47:25] these were not in my set [20:47:46] it first printed my set, then a bunch of of "no response" for completely unrelated hosts [20:48:33] Krinkle: fyi, no "python-irclib" on jessie is the next blocker [20:48:44] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:10] <_joe_> paravoid: that's the beauty of salt [20:49:28] <_joe_> you did use -v, right? [20:49:32] it's so insanely broken all the time in ways one cannot even imagine [20:50:10] <_joe_> no it's just telling you [20:50:17] <_joe_> that those hosts are not responding [20:50:17] telling me what? [20:50:26] <_joe_> so it can't select their grains :P [20:50:26] but I didn't pick those hosts [20:50:34] well [20:50:38] it was hundreds of hosts [20:50:38] <_joe_> you did salt -G 'something' right? [20:50:40] so how very useful [20:50:50] yeah it's semi-normal, but it doesn't always happen either [20:50:51] -C G@, yes [20:51:00] the only way to avoid it for sure is to use batch-mode [20:51:10] 06Operations, 07developer-notice, 07notice: build python-irclib for jessie - https://phabricator.wikimedia.org/T133101#2220169 (10Dzahn) [20:51:12] <_joe_> and since salt doesn't seem to cache that data... [20:51:13] usually now when I don't watch it batched, I do "-b 10000" or whatever [20:51:28] <_joe_> bblack: rotfl [20:51:29] 06Operations, 10Ops-Access-Requests: Requesting access to hive for AGomez (WMF) - https://phabricator.wikimedia.org/T133102#2220183 (10atgo) [20:51:44] and set the timeout small, because in batch mode the timeout is for gathering batch nodes, not for the command itself (don't ask me how you set that in batch mode) [20:51:52] 06Operations: build python-irclib for jessie - https://phabricator.wikimedia.org/T133101#2220214 (10Dzahn) [20:52:14] but then also, sometimes when you over-batch, it still batches [20:52:37] I've seen "-t 10 -b 17" on a set of 16 hosts execute it as the first 3 then the next 13 [20:52:53] maybe in the first 10 seconds it only got 3, but it still tries to get the rest while they're executing? no idewa [20:53:03] salt CLI behavior = never makes rational sense [20:54:02] <_joe_> bblack: to relieve yourself from that nonsense, try 'puppet help help help' :P [20:54:49] mutante: Hm.. too bad. [20:54:53] that actually keeps giving new better out put up through "help help help help help" [20:54:54] * volans deja vu [20:55:17] at which point it gives you ascii art and some philosophy [20:55:18] mutante: btw, I guess we'll need a similar geneti box in both dcs, right? Or are we going to permanently have irc.wm.o running cross-dc? [20:55:29] (re: irc on jessie) [20:56:13] (03PS1) 10Ottomata: [WIP] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [20:56:56] i dont know that part yet, i just focused on making it possible that it runs on jessie so far [20:57:09] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [20:57:12] hahah [20:57:21] (03PS4) 10Volans: MariaDB: complete TLS and master configuration [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) [20:58:14] but i think so, yea, at one point we want all things in both dcs [21:00:23] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [21:05:24] (03CR) 10Dzahn: "Is that extension going to be deployed in read/write mode anytime soon? Or is it waiting for this to close the circle?" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [21:06:06] (03PS1) 10Faidon Liambotis: exim4: adjust WIKI_INTERFACE based on $::site [puppet] - 10https://gerrit.wikimedia.org/r/284352 [21:07:13] (03CR) 10Faidon Liambotis: [C: 032] exim4: adjust WIKI_INTERFACE based on $::site [puppet] - 10https://gerrit.wikimedia.org/r/284352 (owner: 10Faidon Liambotis) [21:08:33] (03CR) 10Ottomata: "Nice, looks good, just a couple of nits:" (032 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) (owner: 10Elukey) [21:10:30] 06Operations, 10MediaWiki-Cache, 10Wikimedia-General-or-Unknown, 05codfw-rollout: Wrong sidebar cached? on fr.wikisource - https://phabricator.wikimedia.org/T133069#2220356 (10faidon) p:05Triage>03Unbreak! Varnish bans for `obj.http.server ~ ^mw2.+` were gradually deployed over the course of the past h... [21:12:48] !log clearing the exim4 retry database on mx2001 [21:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:07] (03CR) 10Ottomata: Add the possibility to set an external database for Hue. (032 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/284204 (https://phabricator.wikimedia.org/T127990) (owner: 10Elukey) [21:13:24] 9011 emails queued... [21:13:50] Reedy: ^ [21:13:53] that's your queue [21:13:56] *cue [21:15:54] que? [21:17:25] ori: My cue for what? [21:17:35] over 9,000 :( [21:17:42] haha [21:25:45] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:27:44] !log running rebuildrecentchanges.php --from=20160419144741 --to=20160419151018 on all wikis [21:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:33:05] 06Operations, 10Monitoring: Check for an oversized exim4 queue indicating mail delivery failures - https://phabricator.wikimedia.org/T133110#2220489 (10faidon) [21:35:11] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Miscellaneous+codfw&h=mx2001.wikimedia.org&jr=&js=&event=hide&ts=0&v=8358&m=exim+queued+messages&vl=messages [21:35:18] hey, ganglia isn't https [21:35:20] ;) [21:38:23] (03CR) 10Legoktm: "It's going to happen real soon™, I just need to write the apache config for the extension (or convince someone to do it for me), but we're" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [21:41:14] 06Operations, 06Labs, 10Monitoring, 10wikitech.wikimedia.org: Bacula recovery of sql files from silver/wikitech fails - https://phabricator.wikimedia.org/T131195#2220548 (10akosiaris) So restoring on different clients in not really possible without some mambo jumbo first. Namely http://www.bacula.org/5.2.x... [21:41:35] PROBLEM - MariaDB Slave Lag: s1 on db1066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 337.42 seconds [21:42:55] PROBLEM - MariaDB Slave Lag: s1 on db1065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 418.19 seconds [21:48:49] ostriches: are you still working on ytterbium? [21:48:54] (puppet is still disabled) [21:49:52] (03PS3) 10Faidon Liambotis: Disable RESTBase highest max SSTables per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/284262 (https://phabricator.wikimedia.org/T133091) (owner: 10Eevans) [21:50:05] (03CR) 10Faidon Liambotis: [C: 032] Disable RESTBase highest max SSTables per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/284262 (https://phabricator.wikimedia.org/T133091) (owner: 10Eevans) [21:50:09] (03CR) 10JanZerebecki: "Bug are there still any mails being sent from that domain?" [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon) [21:51:13] (03CR) 10Faidon Liambotis: [V: 032] Disable RESTBase highest max SSTables per read threshold [puppet] - 10https://gerrit.wikimedia.org/r/284262 (https://phabricator.wikimedia.org/T133091) (owner: 10Eevans) [21:55:37] paravoid: already noted in the last table here: https://phabricator.wikimedia.org/T132521#2202245 [21:55:42] (03CR) 10JanZerebecki: [C: 031] fix puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283857 (owner: 10Mschon) [21:56:07] * volans looking at DB lag ^^^ [21:56:23] !bash 9011 emails queued... Reedy: ^ that's your cue My cue for what? over 9,000 :( [21:56:23] YuviPanda: Stored quip at https://tools.wmflabs.org/bash/quip/AVQwhjiagCrwkbTdmBIT [21:56:37] (03CR) 10JanZerebecki: [C: 031] fixed puppet-lint alignment [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283856 (owner: 10Mschon) [21:56:56] volans: it's the query to insert missing entries into the rc table [21:58:46] 06Operations, 10Ops-Access-Requests: Requesting shell access to Labs for Yann - https://phabricator.wikimedia.org/T133115#2220671 (10Yann) [21:59:01] (03CR) 10JanZerebecki: "And if yes, are they only being sent from 208.80.155.197, the toolserver.org address which is listed as its own mx?" [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon) [21:59:42] 06Operations, 10Ops-Access-Requests: Requesting shell access to Labs for Yann - https://phabricator.wikimedia.org/T133115#2220693 (10Yann) Preferred shell user name: yann/yannf [22:00:09] 06Operations, 10Ops-Access-Requests: Requesting shell access to Labs for Yann - https://phabricator.wikimedia.org/T133115#2220671 (10Krenair) Labs access requests go on wikitech, not phabricator [22:00:27] YuviPanda, ^ [22:02:20] 06Operations, 10Ops-Access-Requests: Requesting shell access to Labs for Yann - https://phabricator.wikimedia.org/T133115#2220724 (10yuvipanda) 05Open>03Invalid Hello! You can just create an account on wikitech.wikimedia.org and get access. https://wikitech.wikimedia.org/wiki/Help:Access has more informat... [22:08:10] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db1065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1836.82 seconds Volans Script to adjust RebuildRecentchanges::rebuildRecentChangesTablePass1 (Ori) due to missing index [22:08:10] ACKNOWLEDGEMENT - MariaDB Slave Lag: s1 on db1066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1882.00 seconds Volans Script to adjust RebuildRecentchanges::rebuildRecentChangesTablePass1 (Ori) due to missing index [22:08:49] thanks [22:28:51] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2220771 (10ezachte) I looked at the backups at stat1001. I need to tidy things up. Some backups occur too often, and have a lot of garbage in it. Apologies for the overhead this... [22:39:59] (03PS1) 10Ori.livneh: ganglia-web: disable overlay_events [puppet] - 10https://gerrit.wikimedia.org/r/284369 [22:41:33] (03CR) 10Ori.livneh: [C: 032] ganglia-web: disable overlay_events [puppet] - 10https://gerrit.wikimedia.org/r/284369 (owner: 10Ori.livneh) [22:42:41] !log killing rc insert query on db1065 and db1066 [22:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:54:52] (03PS1) 10Ladsgroup: ores: Add logformat It adds user agent and removes req and app logging. [puppet] - 10https://gerrit.wikimedia.org/r/284373 (https://phabricator.wikimedia.org/T113754) [22:56:51] RECOVERY - MariaDB Slave Lag: s1 on db1066 is OK: OK slave_sql_lag Replication lag: 0.05 seconds [22:58:46] (03CR) 10Ladsgroup: "I've got this from http://uwsgi.unbit.narkive.com/jEtphIzE/default-log-format-explained" [puppet] - 10https://gerrit.wikimedia.org/r/284373 (https://phabricator.wikimedia.org/T113754) (owner: 10Ladsgroup) [23:00:08] (03PS2) 10Ladsgroup: ores: Add logformat [puppet] - 10https://gerrit.wikimedia.org/r/284373 (https://phabricator.wikimedia.org/T113754) [23:03:01] RECOVERY - MariaDB Slave Lag: s1 on db1065 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [23:12:53] (03CR) 10Volans: "@JCrespo: pupper compiler results here: https://puppet-compiler.wmflabs.org/2507/" [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [23:18:58] 06Operations, 06Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2221191 (10Deskana)