[00:13:01] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [00:51:15] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [00:57:04] PROBLEM - SSH on serpens is CRITICAL: Server answer [01:01:05] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [01:06:45] PROBLEM - SSH on serpens is CRITICAL: Server answer [01:12:25] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [01:20:04] PROBLEM - SSH on serpens is CRITICAL: Server answer [01:21:54] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [01:37:15] PROBLEM - SSH on serpens is CRITICAL: Server answer [01:39:05] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [01:44:45] PROBLEM - SSH on serpens is CRITICAL: Server answer [02:03:56] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [02:15:25] PROBLEM - SSH on serpens is CRITICAL: Server answer [02:19:15] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [02:21:15] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 09m 31s) [02:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 2 02:30:33 UTC 2016 (duration 9m 18s) [02:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:45] PROBLEM - SSH on serpens is CRITICAL: Server answer [02:57:25] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [03:03:15] PROBLEM - SSH on serpens is CRITICAL: Server answer [03:05:14] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [03:16:26] PROBLEM - SSH on serpens is CRITICAL: Server answer [03:24:04] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [03:29:45] PROBLEM - SSH on serpens is CRITICAL: Server answer [03:39:01] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [03:48:22] PROBLEM - SSH on serpens is CRITICAL: Server answer [04:01:31] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [04:09:32] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:20:52] PROBLEM - SSH on serpens is CRITICAL: Server answer [04:27:51] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [04:27:52] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:30:22] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:35:42] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [04:36:12] PROBLEM - SSH on serpens is CRITICAL: Server answer [04:41:53] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:47:33] PROBLEM - SSH on serpens is CRITICAL: Server answer [04:51:11] 06Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 06Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2255696 (10KartikMistry) [04:53:12] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [04:58:53] PROBLEM - SSH on serpens is CRITICAL: Server answer [05:48:57] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: Puppet has 1 failures [06:05:45] (03PS1) 10Yuvipanda: labs: Enable dumps NFS for the math project [puppet] - 10https://gerrit.wikimedia.org/r/286389 (https://phabricator.wikimedia.org/T134026) [06:15:47] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:18:16] (03Draft1) 10Physikerwelt: Enable MathML rendering by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286180 (https://phabricator.wikimedia.org/T131177) [06:18:26] (03PS2) 10Physikerwelt: Enable MathML rendering by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286180 (https://phabricator.wikimedia.org/T131177) [06:31:06] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:42:06] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:42:11] (03PS2) 10Muehlenhoff: Remove access credentials for Moiz [puppet] - 10https://gerrit.wikimedia.org/r/285341 [06:42:26] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:46:33] <_joe_> !log rebooting serpens from ganeti, unreachable [06:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:49:37] RECOVERY - salt-minion processes on serpens is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:49:37] RECOVERY - Check size of conntrack table on serpens is OK: OK: nf_conntrack is 0 % full [06:49:46] RECOVERY - DPKG on serpens is OK: All packages OK [06:49:47] RECOVERY - configured eth on serpens is OK: OK - interfaces up [06:49:47] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:49:48] RECOVERY - Disk space on serpens is OK: DISK OK [06:50:07] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:50:16] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.124 seconds response time [06:50:16] RECOVERY - SSH on serpens is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [06:50:26] RECOVERY - dhclient process on serpens is OK: PROCS OK: 0 processes with command name dhclient [06:50:36] RECOVERY - RAID on serpens is OK: OK: no RAID installed [06:51:16] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:52:29] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove access credentials for Moiz [puppet] - 10https://gerrit.wikimedia.org/r/285341 (owner: 10Muehlenhoff) [06:57:26] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:58:16] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:45] (03CR) 10Giuseppe Lavagetto: "I would probably do it in a later commit tbh" [puppet] - 10https://gerrit.wikimedia.org/r/285368 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [07:01:52] (03PS3) 10Giuseppe Lavagetto: mediawiki::web: drop HHVM define, explicitly block php [puppet] - 10https://gerrit.wikimedia.org/r/285368 (https://phabricator.wikimedia.org/T126310) [07:08:09] RECOVERY - NTP on serpens is OK: NTP OK: Offset -0.0005373954773 secs [07:10:16] !log installing poppler security updates [07:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:11:39] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [07:23:29] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [07:34:54] 06Operations, 10DBA, 13Patch-For-Review: Implement mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#2255880 (10jcrespo) [07:44:44] !log rebooting hasseleh/hassium for kernel upgrade to 4.4 [07:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:50:28] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:05:44] (03PS1) 10Muehlenhoff: Update to 3.19.8-ckt20 [debs/linux] - 10https://gerrit.wikimedia.org/r/286393 [08:11:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 3.19.8-ckt20 [debs/linux] - 10https://gerrit.wikimedia.org/r/286393 (owner: 10Muehlenhoff) [08:14:57] !log Restarted stuck Jenkins (due to IRC plugin) [08:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:22:53] looking at Icinga, I see an alert on lvs1009 about "Could not depool server elastic1031.eqiad.wmnet because of too many down". [08:23:19] As far as I can see (confctl) all elastic servers seem to be pooled... [08:24:07] <_joe_> gehel: lvs1009 has some connectivity issues [08:24:43] Ok, so sorry for the noise... [08:27:32] (03PS1) 10KartikMistry: cxserver: scap3 migration [puppet] - 10https://gerrit.wikimedia.org/r/286395 [08:28:59] (03CR) 10jenkins-bot: [V: 04-1] cxserver: scap3 migration [puppet] - 10https://gerrit.wikimedia.org/r/286395 (owner: 10KartikMistry) [08:38:50] (03PS2) 10Yuvipanda: labs: Enable dumps NFS for the math project [puppet] - 10https://gerrit.wikimedia.org/r/286389 (https://phabricator.wikimedia.org/T134026) [08:38:56] (03PS2) 10KartikMistry: cxserver: scap3 migration [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) [08:38:58] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Enable dumps NFS for the math project [puppet] - 10https://gerrit.wikimedia.org/r/286389 (https://phabricator.wikimedia.org/T134026) (owner: 10Yuvipanda) [08:50:57] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:51:26] (03PS1) 10Gehel: Make es-tool more robust when checking for cluster health [puppet] - 10https://gerrit.wikimedia.org/r/286397 [08:52:45] (03CR) 10Mobrovac: [C: 04-1] "One comment in-lined. Also, this cannot be merged before the matching cxserver deploy repo patch is in place." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [08:55:34] (03CR) 10KartikMistry: cxserver: scap3 migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) (owner: 10KartikMistry) [08:56:02] (03PS3) 10KartikMistry: cxserver: scap3 migration [puppet] - 10https://gerrit.wikimedia.org/r/286395 (https://phabricator.wikimedia.org/T120104) [09:00:35] (03CR) 10DCausse: [C: 031] Make es-tool more robust when checking for cluster health [puppet] - 10https://gerrit.wikimedia.org/r/286397 (owner: 10Gehel) [09:06:21] morebots: https://gerrit.wikimedia.org/r/#/c/286400/ for cxserver/deploy [09:06:22] I am a logbot running on tools-exec-1207. [09:06:22] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [09:06:22] To log a message, type !log . [09:06:30] damn. [09:06:43] mobrovac: https://gerrit.wikimedia.org/r/#/c/286400/ for cxserver/deploy scap3. [09:06:57] kk [09:07:24] heh kart_, people often do that mistake of mentioning morebots instead of me [09:07:31] :) [09:07:35] perhaps i should change my nick [09:07:41] We need change name of bot [09:07:45] :) [09:12:56] mobrovac: should I fix documentation based on your comment? [09:13:04] https://wikitech.wikimedia.org/wiki/Services/Scap_Migration [09:13:09] kart_: i already did :) [09:13:16] oh. good. Thanks! [09:13:18] c/p fail [09:22:30] (03CR) 10Hashar: [C: 031] deployment-prep shinken: Remove old HHVM queue size check [puppet] - 10https://gerrit.wikimedia.org/r/286023 (owner: 10Alex Monk) [09:24:32] jynus: good morning :-} I have filled " GET_LOCK('CategoryMembershipUpdates: " merely because I have no idea what it means [09:24:51] there was a bunch of them in fatalmonitor. Feel free to close the task [09:26:35] I think it is a legitimate error [09:27:14] CategoryMembershipChangeJob is referenced as a potential cause of memleak in hhvm and or maybe it needs to batched somehow [09:28:01] there are numerous jobs that do not have into account huge number of updates [09:28:24] and break it down appropriately [09:29:06] see https://phabricator.wikimedia.org/T134136 [09:35:45] (03PS2) 10Gehel: Make es-tool more robust when checking for cluster health [puppet] - 10https://gerrit.wikimedia.org/r/286397 [09:37:25] (03CR) 10Gehel: [C: 032] Make es-tool more robust when checking for cluster health [puppet] - 10https://gerrit.wikimedia.org/r/286397 (owner: 10Gehel) [09:52:12] PROBLEM - RAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [09:52:31] 06Operations, 10DBA, 13Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#2256193 (10jcrespo) [09:52:33] 06Operations, 10DBA, 10Traffic, 06WMF-Legal, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#2256194 (10jcrespo) [09:52:35] 06Operations, 10DBA: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#2256191 (10jcrespo) 05Open>03stalled a:05jcrespo>03None [09:52:57] 06Operations, 10DBA, 13Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#2256197 (10jcrespo) a:05jcrespo>03None [09:53:12] PROBLEM - Disk space on ms-be1002 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdf1 is not accessible: Input/output error [09:58:28] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2256207 (10jcrespo) [09:58:30] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2256204 (10jcrespo) 05Open>03stalled a:05jcrespo>03None [09:58:44] !log uploaded openldap 2.4.41+wmf1 for jessie-wikimedia to carbon (T130593) [09:58:45] T130593: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593 [09:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:52] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures [10:01:12] RECOVERY - Disk space on ms-be1002 is OK: DISK OK [10:02:55] (03PS2) 10Yuvipanda: deployment-prep shinken: Remove old HHVM queue size check [puppet] - 10https://gerrit.wikimedia.org/r/286023 (owner: 10Alex Monk) [10:03:02] (03CR) 10Yuvipanda: [C: 032 V: 032] deployment-prep shinken: Remove old HHVM queue size check [puppet] - 10https://gerrit.wikimedia.org/r/286023 (owner: 10Alex Monk) [10:08:02] (03PS4) 10Giuseppe Lavagetto: mediawiki::web: drop HHVM define, explicitly block php [puppet] - 10https://gerrit.wikimedia.org/r/285368 (https://phabricator.wikimedia.org/T126310) [10:11:08] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: drop HHVM define, explicitly block php [puppet] - 10https://gerrit.wikimedia.org/r/285368 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [10:14:15] (03PS1) 10Gehel: Remove multicast from Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/286410 (https://phabricator.wikimedia.org/T110236) [10:15:40] (03PS2) 10Gehel: Remove multicast from Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/286410 (https://phabricator.wikimedia.org/T110236) [10:18:54] (03PS1) 10Hoo man: Don't publish Wikidata dumps if a shared failed [puppet] - 10https://gerrit.wikimedia.org/r/286411 [10:22:03] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [10:22:58] (03CR) 10Muehlenhoff: Enable two-factor authentication in sshd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [10:26:02] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:38:00] (03CR) 10Giuseppe Lavagetto: mediawiki::web: drop apache 2.2 support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281419 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [10:40:38] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: drop apache 2.2 support [puppet] - 10https://gerrit.wikimedia.org/r/281419 (https://phabricator.wikimedia.org/T126310) [10:42:01] !log rolling restart of hhvm in codfw for pcre security update [10:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:52:54] mobrovac: Good catch on deployment-cxserver03! [11:06:18] !log rolling restart of hhvm in eqiad for pcre security update [11:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:12:46] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail [11:21:44] !log deployed the last version of Event Logging from tin. Service also restarted. [11:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:29:35] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: drop apache 2.2 support [puppet] - 10https://gerrit.wikimedia.org/r/281419 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [11:39:22] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:44:19] (03PS1) 10Muehlenhoff: Move jobrunner ferm service into the roles [puppet] - 10https://gerrit.wikimedia.org/r/286415 [11:47:21] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2256345 (10Joe) [11:51:00] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::web: drop the HHVM define and mod_php [puppet] - 10https://gerrit.wikimedia.org/r/281418 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [12:06:47] (03PS2) 10Elukey: Return a custom HTTP 503 response for all the stat1001 websites due to maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/285976 (https://phabricator.wikimedia.org/T76348) [12:08:43] (03CR) 10Elukey: [C: 032] "Puppet compiler looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/285976 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [12:08:49] a user is reporting problems with https://commons.wikimedia.org/w/index.php?title=File:UBBasel_Map_1569_Kartenslg_AA_3-5.tif&page=2 [12:08:59] over in wikimedia-tech [12:10:52] <_joe_> mobrovac: uhm [12:11:47] <_joe_> something is wrong with that image or others have the same issue [12:12:35] asked the person [12:12:42] !log Merged Varnish cache::misc change to force HTTP 503 for datasets.wikimedia.org, stats.wikimedia.org, metrics.wikimedia.org as prep-step for OS reimage. [12:12:56] <_joe_> I can confirm other images render perfectly [12:13:09] mmmm I added a space [12:13:28] !log deployed Varnish cache::misc change to force HTTP 503 for datasets.wikimedia.org, stats.wikimedia.org, metrics.wikimedia.org as prep-step for OS reimage. [12:13:51] mmmm [12:14:26] <_joe_> mobrovac: it seems like a specific problem to me [12:15:14] !log deployed Varnish change to force HTTP 503 for datasets.wikimedia.org, stats.wikimedia.org, metrics.wikimedia.org as prep-step for OS reimage. [12:15:31] _joe_: yeah, file size: 1.56 GB [12:15:33] wth? [12:15:34] definitely not some weird chars, same message logged in analytics [12:16:12] _joe_: that link now works for me, fwiw [12:16:59] wow [12:17:07] 1.56 GB tif file ? [12:17:18] seems so [12:17:38] i can guarantee that the extension was not written with that usecase in mind :) [12:18:10] really? :D [12:18:35] <_joe_> rotfl [12:18:44] isn't there an upload limit or sth? [12:18:46] ah, actually, there are probably not 3 pages in that tiff [12:19:00] i mean 1.5GB, how do you upload that? [12:19:26] the upload bots can manage that. [12:20:04] <_joe_> interestingly, I did an apache change that should've been a noop but I just had a slight doubt about imagescalers [12:20:12] <_joe_> and ofc an issue came up :P [12:22:08] (03PS1) 10Mark Bergsma: Fix Facilities power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286416 [12:22:10] (03PS1) 10Mark Bergsma: Add codfw PDUs and power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286417 [12:22:34] hmm, there should be 3 pages. [12:23:05] (03PS2) 10Mark Bergsma: Add codfw PDUs and power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286417 [12:23:07] (03PS2) 10Mark Bergsma: Fix Facilities power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286416 [12:24:14] mobrovac: file bug reports, wait for someone to be willing to work on PagedTiffHandler :) [12:24:52] thedj: i just pointed out here somebody's got a problem in #tech :P [12:25:30] hmm morebots hasn't appeared to respond in the channel when its done the last couple of logs (but they are going into the log) [12:25:54] (03PS3) 10Mark Bergsma: Add codfw PDUs and power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286417 [12:25:56] (03PS3) 10Mark Bergsma: Fix Facilities power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286416 [12:28:01] (03CR) 10Mark Bergsma: [C: 032] Fix Facilities power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286416 (owner: 10Mark Bergsma) [12:35:28] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: add nodepool restart to contint-admins - https://phabricator.wikimedia.org/T133990#2251528 (10MoritzMuehlenhoff) That access request contains multiple points, please specify what permission change you... [12:37:11] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2251644 (10MoritzMuehlenhoff) Can you specicify what log files/directories exactly you need access to? [12:38:14] 06Operations, 10ops-eqiad, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2256416 (10MoritzMuehlenhoff) a:03Cmjohnson [12:42:16] (03PS4) 10Mark Bergsma: Add codfw PDUs and power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286417 [12:44:08] (03CR) 10Mark Bergsma: [C: 032] Add codfw PDUs and power aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286417 (owner: 10Mark Bergsma) [12:49:42] (03PS1) 10Elukey: Revert "Return a custom HTTP 503 response for all the stat1001 websites due to maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/286418 [12:53:09] (03PS1) 10BBlack: cache_misc: fix maintenance flag support [puppet] - 10https://gerrit.wikimedia.org/r/286419 [12:55:55] (03PS1) 10Mark Bergsma: Add missing tokensets for codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/286420 [12:57:22] (03CR) 10Mark Bergsma: [C: 032] Add missing tokensets for codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/286420 (owner: 10Mark Bergsma) [12:59:24] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: add nodepool restart to contint-admins - https://phabricator.wikimedia.org/T133990#2256451 (10JanZerebecki) Does "service nodepool restart" work and does it need to do additional work over the equivale... [13:04:20] 06Operations, 10Wikimedia-Stream, 13Patch-For-Review: Ferm rules for rcstream - https://phabricator.wikimedia.org/T104981#2256491 (10MoritzMuehlenhoff) 05Open>03Resolved These are enabled for some time now, closing the bug. [13:05:46] (03CR) 10Elukey: [C: 031] "Puppet compiler shows the correct change in VCL:" [puppet] - 10https://gerrit.wikimedia.org/r/286419 (owner: 10BBlack) [13:06:26] (03PS2) 10Elukey: cache_misc: fix maintenance flag support [puppet] - 10https://gerrit.wikimedia.org/r/286419 (owner: 10BBlack) [13:07:34] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2256496 (10elukey) Websites still working since we had a problem with the VCL change, the following patch should fix the issue: https://gerrit.wikimedia.org/r/#/c/286419/ [13:08:36] (03CR) 10BBlack: [C: 032] cache_misc: fix maintenance flag support [puppet] - 10https://gerrit.wikimedia.org/r/286419 (owner: 10BBlack) [13:13:34] !log stopping db1040 mysql for backup before cloning [13:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:43] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay 0 seconds [13:16:59] ^lol, 0 seconds with it down [13:17:10] a problem with 0 lag [13:26:25] (03PS5) 10Gehel: Don't create new log files for cirrus-suggest with logrotate [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [13:27:26] (03CR) 10Muehlenhoff: "I've now written up the bigger implementation plan wrt the authentication servers at https://wikitech.wikimedia.org/wiki/Yubikey-2FA" [puppet] - 10https://gerrit.wikimedia.org/r/282160 (owner: 10Muehlenhoff) [13:29:06] (03CR) 10Gehel: [C: 032] Don't create new log files for cirrus-suggest with logrotate [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [13:33:46] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2256600 (10Gehel) [13:34:08] (03PS1) 10BBlack: cache_maps: varnish4-only [puppet] - 10https://gerrit.wikimedia.org/r/286427 [13:36:38] (03PS2) 10JanZerebecki: Don't publish Wikidata dumps if a shared failed [puppet] - 10https://gerrit.wikimedia.org/r/286411 (https://phabricator.wikimedia.org/T133924) (owner: 10Hoo man) [13:36:55] (03PS2) 10BBlack: cache_maps: varnish4-only [puppet] - 10https://gerrit.wikimedia.org/r/286427 [13:43:36] (03PS1) 10Hashar: admin: 'service nodepool restart' for contintadmins [puppet] - 10https://gerrit.wikimedia.org/r/286428 (https://phabricator.wikimedia.org/T133990) [13:47:54] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: add nodepool restart to contint-admins - https://phabricator.wikimedia.org/T133990#2256625 (10hashar) https://gerrit.wikimedia.org/r/286428 adds sudo rule for `service nodepool r... [13:55:24] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2256640 (10Dzahn) @jcrespo sorry, this is like the other upgrade ticket. it should have been just "shutdown or upgrade precise" not necessarily "jessie". Should i rename it? [14:02:50] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2256684 (10Ottomata) Hm, we were planning on running 2 cassandra instances per node for a total of 6 instances. Just stating the obvious here for my own benefit: - If we go with RAID 0... [14:08:06] (03CR) 10MarkTraceur: [C: 031] "LGTM, we'll get this deployed in a couple of hours." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [14:10:21] (03CR) 10Aklapper: "@Faidon: Is this still wanted? If so, should this get merged? Asking as this has been rotting here for more than a year without a review.." [dns] - 10https://gerrit.wikimedia.org/r/143762 (owner: 10Faidon Liambotis) [14:13:13] Request from 90.180.83.194 via cp3031 cp3031, Varnish XID 1753943379 [14:13:16] Error: 503, Service Unavailable at Mon, 02 May 2016 14:12:45 GMT [14:13:34] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:35] PROBLEM - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:35] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:35] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:52] Error [14:13:52] Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. [14:13:55] Please try again in a few minutes. [14:13:55] PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:04] PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:04] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:20] esams down? [14:14:20] PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:14:21] PROBLEM - Host cp3036 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:21] PROBLEM - Host cp3005 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:21] PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:21] PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:53] <_joe_> looks like it [14:14:54] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:54] PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:54] PROBLEM - Host cp3031 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:54] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:04] PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:04] PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:13] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:14] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:23] "Request from 80.176.129.180 via cp3043 cp3043, Varnish XID 1784571515 [14:15:24] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:24] PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:24] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:24] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:24] PROBLEM - Host cp3035 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:25] Error: 503, Service Unavailable at Mon, 02 May 2016 14:14:49 GMT" [14:15:29] (03CR) 10Aude: Enable Visual Editor on all namespaces of plwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [14:15:33] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:33] PROBLEM - Host cr2-esams IPv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ffff::3 [14:15:33] PROBLEM - Host ns2-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::e [14:15:34] PROBLEM - Host cp3041 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:34] PROBLEM - Host cp3044 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:34] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:34] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:35] (03CR) 10MarkTraceur: "So, aside from James's suggestion that /testwiki2?/ is a terrible thing, I'm curious if the multiple foreign repository thing has been tes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [14:15:43] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:43] PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:43] PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100% [14:15:43] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:43] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:43] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:15:44] PROBLEM - Host cp3046 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:44] Planned mantainance people? [14:15:52] Something just broke:( [14:15:53] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [14:15:53] PROBLEM - Host cr2-knams is DOWN: PING CRITICAL - Packet loss = 100% [14:15:53] PROBLEM - Host cr2-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:15:53] PROBLEM - Host csw2-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:15:53] PROBLEM - Host cp3012 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:53] PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:54] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:54] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [14:15:55] PROBLEM - Host cp3018 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:55] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:56] PROBLEM - Host bast3001 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:56] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:57] PROBLEM - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:57] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [14:16:11] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2:b [14:16:18] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [14:16:19] PROBLEM - Host cp3017 is DOWN: PING CRITICAL - Packet loss = 100% [14:16:23] (03PS1) 10Giuseppe Lavagetto: Set esams down [dns] - 10https://gerrit.wikimedia.org/r/286435 [14:16:27] <_joe_> godog: ^^ [14:16:28] PROBLEM - Host 2620:0:862:1:91:198:174:122 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:1:91:198:174:122 [14:16:50] (03CR) 10Filippo Giunchedi: [C: 031] Set esams down [dns] - 10https://gerrit.wikimedia.org/r/286435 (owner: 10Giuseppe Lavagetto) [14:16:52] LGTM! [14:16:55] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:17:03] PROBLEM - Host misc-web-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [14:17:03] PROBLEM - Host 91.198.174.106 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:04] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:06] Planned upgrades needing a server down? [14:17:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Set esams down [dns] - 10https://gerrit.wikimedia.org/r/286435 (owner: 10Giuseppe Lavagetto) [14:17:13] <_joe_> ShakespeareFan00: nope [14:17:27] PROBLEM - Host 91.198.174.122 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:27] PROBLEM - Host asw-esams.mgmt.esams.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [14:17:27] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:17:51] LGTM too, thanks _joe_ :) [14:18:09] What is LGTM? [14:18:12] <_joe_> merged, running authdns update [14:18:16] looks good to me [14:18:18] <_joe_> ShakespeareFan00: https://gerrit.wikimedia.org/r/286435 [14:18:29] <_joe_> I'm taking a datacenter out of the dns map [14:18:32] I'm connecting from Iran, and I have issue too [14:18:33] https://doc.wikimedia.org/mw-tools-scap/ [14:18:38] it says server is down [14:18:41] * apergos peeks in [14:18:46] <_joe_> It's failing on eeden ofc [14:18:50] I'm not sure if it's related or not [14:18:58] Amir1, yes, in reality it will affect Europe, Africa and Asia [14:19:07] jynus: oh, thanks :) [14:19:13] Amir1: Does IRan route it's DNS through a proxy? [14:19:19] I'll fix eeden [14:19:20] Just Europe is where our datacenter is [14:19:20] nope [14:19:33] :c: is down [14:19:44] PROBLEM - Host 2620:0:862:1:91:198:174:106 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:1:91:198:174:106 [14:20:05] RECOVERY - Host cp3006 is UP: PING WARNING - Packet loss = 80%, RTA = 82.96 ms [14:20:05] RECOVERY - Host cp3008 is UP: PING WARNING - Packet loss = 80%, RTA = 83.71 ms [14:20:05] RECOVERY - Host cp3044 is UP: PING WARNING - Packet loss = 80%, RTA = 83.10 ms [14:20:05] RECOVERY - Host cp3037 is UP: PING WARNING - Packet loss = 44%, RTA = 83.41 ms [14:20:05] RECOVERY - Host cp3003 is UP: PING WARNING - Packet loss = 44%, RTA = 85.56 ms [14:20:05] RECOVERY - Host cp3009 is UP: PING WARNING - Packet loss = 44%, RTA = 83.94 ms [14:20:05] RECOVERY - Host cp3007 is UP: PING WARNING - Packet loss = 44%, RTA = 84.60 ms [14:20:06] RECOVERY - Host cp3046 is UP: PING WARNING - Packet loss = 44%, RTA = 86.14 ms [14:20:14] RECOVERY - Host cp3049 is UP: PING OK - Packet loss = 0%, RTA = 83.31 ms [14:20:14] RECOVERY - Host ms-be3004 is UP: PING OK - Packet loss = 0%, RTA = 83.69 ms [14:20:20] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.09 ms [14:20:21] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 82.97 ms [14:20:21] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 83.42 ms [14:20:21] RECOVERY - Host ms-be3003 is UP: PING OK - Packet loss = 0%, RTA = 83.49 ms [14:20:21] RECOVERY - Host lvs3002 is UP: PING OK - Packet loss = 0%, RTA = 84.15 ms [14:20:21] RECOVERY - Host lvs3004 is UP: PING OK - Packet loss = 0%, RTA = 83.47 ms [14:20:28] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 84.10 ms [14:20:28] RECOVERY - Host lvs3003 is UP: PING OK - Packet loss = 0%, RTA = 83.76 ms [14:20:28] RECOVERY - Host cp3047 is UP: PING OK - Packet loss = 0%, RTA = 83.48 ms [14:20:28] RECOVERY - Host cp3045 is UP: PING OK - Packet loss = 0%, RTA = 84.73 ms [14:20:28] RECOVERY - Host cp3035 is UP: PING OK - Packet loss = 0%, RTA = 83.75 ms [14:20:28] RECOVERY - Host multatuli is UP: PING OK - Packet loss = 0%, RTA = 84.21 ms [14:20:28] RECOVERY - Host cp3034 is UP: PING OK - Packet loss = 0%, RTA = 83.34 ms [14:20:29] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.16 ms [14:20:29] RECOVERY - Host cp3005 is UP: PING OK - Packet loss = 0%, RTA = 83.35 ms [14:20:31] !log stopped gdnsd on eeden [14:20:35] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.87 ms [14:20:35] RECOVERY - Host ms-fe3002 is UP: PING OK - Packet loss = 0%, RTA = 88.07 ms [14:20:35] RECOVERY - Host cp3031 is UP: PING OK - Packet loss = 0%, RTA = 86.75 ms [14:20:35] RECOVERY - Host 2620:0:862:1:91:198:174:122 is UP: PING OK - Packet loss = 0%, RTA = 83.49 ms [14:20:35] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 83.40 ms [14:20:35] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 83.63 ms [14:20:35] RECOVERY - Host maerlant is UP: PING OK - Packet loss = 0%, RTA = 83.81 ms [14:20:36] RECOVERY - Host cp3042 is UP: PING OK - Packet loss = 0%, RTA = 82.75 ms [14:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:43] of course as it recovers, but still :P [14:20:44] SO what happened? [14:20:47] RECOVERY - Host cp3033 is UP: PING OK - Packet loss = 0%, RTA = 83.34 ms [14:20:47] RECOVERY - Host nescio is UP: PING OK - Packet loss = 0%, RTA = 83.44 ms [14:20:47] RECOVERY - Host cp3032 is UP: PING OK - Packet loss = 0%, RTA = 84.65 ms [14:20:47] RECOVERY - Host lvs3001 is UP: PING OK - Packet loss = 0%, RTA = 85.59 ms [14:20:47] RECOVERY - Host cp3040 is UP: PING OK - Packet loss = 0%, RTA = 82.89 ms [14:20:47] RECOVERY - Host cp3039 is UP: PING OK - Packet loss = 0%, RTA = 83.45 ms [14:20:47] RECOVERY - Host cp3030 is UP: PING OK - Packet loss = 0%, RTA = 84.79 ms [14:20:48] RECOVERY - Host cp3004 is UP: PING OK - Packet loss = 0%, RTA = 83.29 ms [14:20:48] RECOVERY - Host cp3036 is UP: PING OK - Packet loss = 0%, RTA = 83.95 ms [14:20:49] RECOVERY - Host cp3016 is UP: PING OK - Packet loss = 0%, RTA = 83.65 ms [14:20:49] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 83.19 ms [14:20:50] RECOVERY - Host cp3043 is UP: PING OK - Packet loss = 0%, RTA = 84.32 ms [14:21:04] RECOVERY - Host misc-web-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.25 ms [14:21:04] RECOVERY - Host cp3018 is UP: PING OK - Packet loss = 0%, RTA = 83.06 ms [14:21:04] RECOVERY - Host cp3013 is UP: PING OK - Packet loss = 0%, RTA = 82.96 ms [14:21:04] RECOVERY - Host cp3014 is UP: PING OK - Packet loss = 0%, RTA = 83.01 ms [14:21:04] RECOVERY - Host cp3012 is UP: PING OK - Packet loss = 0%, RTA = 83.06 ms [14:21:04] RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 83.11 ms [14:21:05] RECOVERY - Host cp3019 is UP: PING OK - Packet loss = 0%, RTA = 83.07 ms [14:21:05] RECOVERY - Host cp3041 is UP: PING OK - Packet loss = 0%, RTA = 82.90 ms [14:21:06] RECOVERY - Host cp3021 is UP: PING OK - Packet loss = 0%, RTA = 83.24 ms [14:21:06] RECOVERY - Host cp3015 is UP: PING OK - Packet loss = 0%, RTA = 83.49 ms [14:21:08] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 84.61 ms [14:21:20] <_joe_> ShakespeareFan00: a datacenter (the one that serves europe) was unreachable for a few minutes [14:21:27] RECOVERY - Host bast3001 is UP: PING OK - Packet loss = 0%, RTA = 84.11 ms [14:21:27] RECOVERY - Host asw-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 83.59 ms [14:21:28] RECOVERY - Host 91.198.174.122 is UP: PING OK - Packet loss = 0%, RTA = 84.29 ms [14:21:28] RECOVERY - Host csw2-esams.mgmt.esams.wmnet is UP: PING OK - Packet loss = 0%, RTA = 84.79 ms [14:21:28] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.57 ms [14:21:37] Any reason why it was unreachable? [14:21:50] <_joe_> ShakespeareFan00: we still don't know [14:21:57] RECOVERY - Host cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 83.44 ms [14:21:58] ShakespeareFan00: things happen :) [14:22:13] <_joe_> it wasn't some planned maintenance AFAICS [14:22:17] RECOVERY - Host cr2-knams is UP: PING OK - Packet loss = 0%, RTA = 85.92 ms [14:22:17] RECOVERY - Host ns2-v6 is UP: PING OK - Packet loss = 0%, RTA = 84.37 ms [14:22:17] RECOVERY - Host cr2-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 84.22 ms [14:22:27] RECOVERY - Host 2620:0:862:1:91:198:174:106 is UP: PING OK - Packet loss = 0%, RTA = 83.15 ms [14:22:33] ShakespeareFan00: There will be an incident report available under https://wikitech.wikimedia.org/wiki/Incident_documentation once investigations have happened. [14:22:48] Patience please. [14:22:48] !log starting gdnsd on esams (esams is marked down there) [14:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:16] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 84.10 ms [14:24:02] (03PS1) 10Giuseppe Lavagetto: Revert "Set esams down" [dns] - 10https://gerrit.wikimedia.org/r/286438 [14:24:04] (03CR) 10Urbanecm: "Reply" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [14:24:27] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [14:25:07] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [14:25:17] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: puppet fail [14:25:18] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [14:25:18] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [14:25:37] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [14:25:38] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [14:25:38] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: puppet fail [14:25:38] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: puppet fail [14:25:47] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [14:25:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:25:48] PROBLEM - Apache HTTP on mw2027 is CRITICAL: Connection refused [14:25:57] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail [14:26:16] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: puppet fail [14:26:16] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [14:26:18] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Puppet has 3 failures [14:26:18] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [14:26:18] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail [14:26:27] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 3 failures [14:26:27] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [14:27:18] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Puppet has 3 failures [14:27:56] RECOVERY - Apache HTTP on mw2027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.848 second response time [14:28:17] (03CR) 10Aude: Enable Visual Editor on all namespaces of plwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [14:31:27] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:34:49] hey, what's up? [14:35:10] no backlog to read [14:35:18] <_joe_> akosiaris_: esams was briefly unreachable [14:35:20] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:35:23] <_joe_> also, we missed you [14:36:06] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2256770 (10jcrespo) No, proxies will be jessie. They are not databases, completely different scenario. Still blocked by reviews on https://gerrit.wikimedia.org/r/273958 [14:36:36] do we know why? [14:37:19] <_joe_> akosiaris_: a flapping link it appears [14:38:28] (03PS3) 10Dzahn: adding install params for restbase200[7-9] Bug:T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285994 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [14:38:43] ok, so I assume you have it under control. back to my Easter Monday. [14:38:54] (03CR) 10JanZerebecki: [C: 031] Don't publish Wikidata dumps if a shared failed [puppet] - 10https://gerrit.wikimedia.org/r/286411 (https://phabricator.wikimedia.org/T133924) (owner: 10Hoo man) [14:39:16] (03CR) 10Dzahn: [C: 032] adding install params for restbase200[7-9] Bug:T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285994 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [14:39:26] (03CR) 10Dzahn: [V: 032] adding install params for restbase200[7-9] Bug:T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285994 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [14:42:03] apergos: hey. can you deploy https://gerrit.wikimedia.org/r/#/c/286411/2 so it is done when the cron job runs later, please? [14:42:58] jzerebecki: I'm off these days. do you mind it waiting til wednesday? [14:43:16] I just peeked in due to the pages [14:43:33] apergos: totally. will find someone else. [14:43:38] k [14:45:27] moritzm: ^^? [14:45:50] PROBLEM - puppet last run on mc2002 is CRITICAL: CRITICAL: puppet fail [14:46:11] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: puppet fail [14:46:11] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: puppet fail [14:46:21] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: puppet fail [14:46:21] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: puppet fail [14:46:21] PROBLEM - puppet last run on wtp1007 is CRITICAL: CRITICAL: puppet fail [14:46:22] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: puppet fail [14:46:30] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: puppet fail [14:46:40] PROBLEM - puppet last run on mw1106 is CRITICAL: CRITICAL: puppet fail [14:46:40] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: puppet fail [14:46:41] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: puppet fail [14:46:41] PROBLEM - puppet last run on lvs1007 is CRITICAL: CRITICAL: puppet fail [14:46:41] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: puppet fail [14:46:50] PROBLEM - puppet last run on mc2012 is CRITICAL: CRITICAL: puppet fail [14:47:00] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: puppet fail [14:47:01] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: puppet fail [14:47:10] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: puppet fail [14:47:10] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: puppet fail [14:47:10] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: puppet fail [14:47:10] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: puppet fail [14:47:11] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: puppet fail [14:47:11] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [14:47:11] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: puppet fail [14:47:12] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: puppet fail [14:47:20] PROBLEM - puppet last run on aqs1003 is CRITICAL: CRITICAL: puppet fail [14:47:20] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: puppet fail [14:47:21] PROBLEM - puppet last run on mw2031 is CRITICAL: CRITICAL: puppet fail [14:47:21] PROBLEM - puppet last run on elastic1017 is CRITICAL: CRITICAL: puppet fail [14:47:30] PROBLEM - puppet last run on mw2009 is CRITICAL: CRITICAL: puppet fail [14:47:30] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: puppet fail [14:47:30] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: puppet fail [14:47:31] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:47:31] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: puppet fail [14:47:31] PROBLEM - puppet last run on potassium is CRITICAL: CRITICAL: puppet fail [14:47:32] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: puppet fail [14:47:32] (03PS1) 10Mark Bergsma: Provide Torrus with the (specific) SNMP community used for codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/286447 [14:47:32] PROBLEM - puppet last run on elastic2002 is CRITICAL: CRITICAL: puppet fail [14:47:32] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: puppet fail [14:47:34] <_joe_> uh what's up? [14:47:40] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: puppet fail [14:47:40] PROBLEM - puppet last run on db2045 is CRITICAL: CRITICAL: puppet fail [14:47:40] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: puppet fail [14:47:40] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: puppet fail [14:47:41] PROBLEM - puppet last run on mw1024 is CRITICAL: CRITICAL: puppet fail [14:47:41] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [14:47:41] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: puppet fail [14:47:50] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: puppet fail [14:47:51] PROBLEM - puppet last run on rdb2006 is CRITICAL: CRITICAL: puppet fail [14:47:51] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: puppet fail [14:47:51] PROBLEM - puppet last run on mw2015 is CRITICAL: CRITICAL: puppet fail [14:47:51] PROBLEM - puppet last run on mw2064 is CRITICAL: CRITICAL: puppet fail [14:47:52] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: puppet fail [14:47:52] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: puppet fail [14:47:54] i probably broke it [14:48:00] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: puppet fail [14:48:01] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: puppet fail [14:48:01] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: puppet fail [14:48:01] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: puppet fail [14:48:01] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: puppet fail [14:48:01] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: puppet fail [14:48:01] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: puppet fail [14:48:02] PROBLEM - puppet last run on db2008 is CRITICAL: CRITICAL: puppet fail [14:48:02] PROBLEM - puppet last run on ms-fe1001 is CRITICAL: CRITICAL: puppet fail [14:48:03] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: puppet fail [14:48:03] PROBLEM - puppet last run on elastic1004 is CRITICAL: CRITICAL: puppet fail [14:48:04] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [14:48:10] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: puppet fail [14:48:10] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: puppet fail [14:48:10] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: puppet fail [14:48:11] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:48:11] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: puppet fail [14:48:11] PROBLEM - puppet last run on analytics1043 is CRITICAL: CRITICAL: puppet fail [14:48:11] PROBLEM - puppet last run on auth1001 is CRITICAL: CRITICAL: puppet fail [14:48:12] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: puppet fail [14:48:20] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [14:48:20] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: puppet fail [14:48:20] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: puppet fail [14:48:20] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: puppet fail [14:48:21] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: puppet fail [14:48:21] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: puppet fail [14:48:31] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: puppet fail [14:48:31] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: puppet fail [14:48:31] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: puppet fail [14:48:40] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: puppet fail [14:48:40] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: puppet fail [14:48:41] PROBLEM - puppet last run on mw2080 is CRITICAL: CRITICAL: puppet fail [14:48:41] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: puppet fail [14:48:41] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: puppet fail [14:48:41] PROBLEM - puppet last run on mw1091 is CRITICAL: CRITICAL: puppet fail [14:48:41] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: puppet fail [14:48:41] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: puppet fail [14:48:42] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: puppet fail [14:48:42] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: puppet fail [14:48:42] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: puppet fail [14:48:43] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: puppet fail [14:49:00] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail [14:49:00] PROBLEM - puppet last run on restbase-test2002 is CRITICAL: CRITICAL: puppet fail [14:49:01] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: puppet fail [14:49:01] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: puppet fail [14:49:01] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: puppet fail [14:49:10] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:11] PROBLEM - puppet last run on mw2082 is CRITICAL: CRITICAL: puppet fail [14:49:11] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: puppet fail [14:49:12] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: puppet fail [14:49:20] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: puppet fail [14:49:21] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: puppet fail [14:49:30] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: puppet fail [14:49:30] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: puppet fail [14:49:30] PROBLEM - puppet last run on mc2016 is CRITICAL: CRITICAL: puppet fail [14:49:30] PROBLEM - puppet last run on sinistra is CRITICAL: CRITICAL: puppet fail [14:49:31] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: puppet fail [14:49:31] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail [14:49:31] PROBLEM - puppet last run on mw2019 is CRITICAL: CRITICAL: puppet fail [14:49:31] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: puppet fail [14:49:41] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: puppet fail [14:49:42] PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: puppet fail [14:49:43] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: puppet fail [14:49:50] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: puppet fail [14:49:50] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: puppet fail [14:49:51] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [14:49:52] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: puppet fail [14:50:00] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:00] PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: puppet fail [14:50:00] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:00] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: puppet fail [14:50:00] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: puppet fail [14:50:01] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: puppet fail [14:50:01] PROBLEM - puppet last run on mw2067 is CRITICAL: CRITICAL: puppet fail [14:50:01] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: puppet fail [14:50:01] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: puppet fail [14:50:02] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: puppet fail [14:50:10] PROBLEM - puppet last run on bast3001 is CRITICAL: CRITICAL: puppet fail [14:50:11] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: puppet fail [14:50:11] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: puppet fail [14:50:11] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: puppet fail [14:50:21] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [14:50:21] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: puppet fail [14:50:21] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: puppet fail [14:50:22] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:32] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: puppet fail [14:50:51] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: puppet fail [14:51:00] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [14:51:01] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:32] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:51:38] (03PS2) 10Mark Bergsma: Provide Torrus with the (specific) SNMP community used for codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/286447 [14:52:51] !log rolling restart of zookeeper to pick up Java update [14:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:20] (03CR) 10Mark Bergsma: [C: 032] Provide Torrus with the (specific) SNMP community used for codfw PDUs [puppet] - 10https://gerrit.wikimedia.org/r/286447 (owner: 10Mark Bergsma) [15:01:30] (03CR) 10Luke081515: [C: 031] admin: 'service nodepool restart' for contintadmins [puppet] - 10https://gerrit.wikimedia.org/r/286428 (https://phabricator.wikimedia.org/T133990) (owner: 10Hashar) [15:02:03] where is jouncebot? [15:02:07] SWAT? [15:02:37] anomie, ostriches, thcipriani, MarkTraceur, MatmaRex_, Urbanecm, mobrovac, aude: swat time [15:02:38] (03CR) 10Dzahn: [C: 031] admin: 'service nodepool restart' for contintadmins [puppet] - 10https://gerrit.wikimedia.org/r/286428 (https://phabricator.wikimedia.org/T133990) (owner: 10Hashar) [15:02:54] yeah [15:02:59] Ready [15:03:01] hehe Krenair, was about to impersonate jouncebot myself :P [15:03:43] (03CR) 10Alex Monk: [C: 032] Set $wgRateLimits['upload'] for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [15:05:01] (03PS4) 10Alex Monk: Set $wgRateLimits['upload'] for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [15:05:08] (03CR) 10Alex Monk: [C: 032] Set $wgRateLimits['upload'] for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [15:05:08] I'm fighting hotel wifi at the moment. I would appreciate it if some else could SWAT this morning. [15:05:24] thcipriani: seems like Alex is doing that actually? ;) [15:05:33] certainly going to be doing some of it [15:05:38] (03Merged) 10jenkins-bot: Set $wgRateLimits['upload'] for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [15:06:13] (03CR) 10Jforrester: [C: 04-1] Enable user signature in VE in plwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) (owner: 10Urbanecm) [15:06:52] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2256909 (10Eevans) >>! In T133785#2256684, @Ottomata wrote: > Hm, we were planning on running 2 cassandra instances per node for a total of 6 instances. > > Just stating the obvious her... [15:07:12] Krenair: thcipriani: poke me as needed. [15:07:16] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/285700/ (duration: 00m 42s) [15:07:20] MatmaRex_, ^ [15:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:51] Krenair: thanks [15:09:19] mobrovac, will do yours next [15:09:22] k [15:12:02] there you are jouncebot! [15:12:07] I just startedi t [15:12:16] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:12:25] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:12:26] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:12:26] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:12:31] thing is I did it wrongly [15:12:36] RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:12:36] RECOVERY - puppet last run on dbproxy1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:12:45] RECOVERY - puppet last run on aqs1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:12:55] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:12:56] RECOVERY - puppet last run on elastic2002 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [15:12:56] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:13:16] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:13:25] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:13:26] RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:27] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:27] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:35] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:13:35] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:36] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:13:36] RECOVERY - puppet last run on mc2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:45] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:46] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:13:47] RECOVERY - puppet last run on mw2009 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:13:48] Krenair: here [15:13:56] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:13:57] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:13:57] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:13:57] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:13:57] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:58] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:14:05] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:06] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:06] RECOVERY - puppet last run on restbase-test2002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:14:07] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:14:07] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:16] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:14:17] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:14:17] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:14:17] RECOVERY - puppet last run on elastic1004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:14:17] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:25] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:25] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:25] RECOVERY - puppet last run on rdb2006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:14:26] RECOVERY - puppet last run on wtp2005 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:14:35] RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:35] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:14:36] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:36] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:14:36] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:47] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:47] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:14:48] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:49] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:49] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:14:49] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:56] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:14:56] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:05] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:15:05] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:06] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:15:06] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:15:06] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:06] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:15:06] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:15:06] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:15:07] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:07] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:16] RECOVERY - puppet last run on mw2064 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:15:16] RECOVERY - puppet last run on lvs1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:17] RECOVERY - puppet last run on wtp2001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:15:17] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [15:15:17] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:17] RECOVERY - puppet last run on mw2080 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:15:17] RECOVERY - puppet last run on auth1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:15:17] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:25] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:26] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:15:26] RECOVERY - puppet last run on db2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:26] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:15:26] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:15:27] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:15:27] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:15:27] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:15:35] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:35] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:36] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:15:36] RECOVERY - puppet last run on mw2015 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:15:36] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:15:37] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:45] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:46] RECOVERY - puppet last run on mw2010 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:15:46] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:46] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:15:47] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:56] RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:56] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:06] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:16:06] RECOVERY - puppet last run on sinistra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:16] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:26] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:16:36] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:45] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:56] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:57] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:17:05] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:17:06] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:16] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:17:17] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:17:25] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:25] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:17:26] RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:35] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:17:36] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:17:37] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:17:46] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:17:47] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:17:55] RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:17:55] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:17:56] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:18:15] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:16] RECOVERY - puppet last run on mw2019 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:18:16] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:16] RECOVERY - puppet last run on mw2117 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:26] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:18:26] RECOVERY - puppet last run on mw2082 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:36] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:37] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:18:45] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:18:46] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:18:46] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:18:55] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:18:55] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:19:05] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:05] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:19:06] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:15] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:19:16] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:16] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:19:47] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:47] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:55] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:56] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:05] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:07] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:15] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:25] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:26] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:26] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:20:38] mobrovac, syncing [15:20:43] k [15:21:00] !log krenair@tin Synchronized php-1.27.0-wmf.22/extensions/Math/MathRestbaseInterface.php: https://gerrit.wikimedia.org/r/#/c/286412/ (duration: 00m 26s) [15:21:03] mobrovac, ^ [15:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:09] * mobrovac checking [15:21:36] (03PS1) 10Jcrespo: Prepare db1040 for reimaging to jessie [puppet] - 10https://gerrit.wikimedia.org/r/286453 [15:21:36] (03PS3) 10Jforrester: Enable Visual Editor in NS_PROJECT on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [15:21:36] (03CR) 10Jforrester: [C: 031] Enable Visual Editor in NS_PROJECT on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [15:21:48] (03PS2) 10Jcrespo: Prepare db1040 for reimaging to jessie [puppet] - 10https://gerrit.wikimedia.org/r/286453 [15:22:04] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2256927 (10Ottomata) > I assume you mean s/partition/array/ above. Indeed danke. > It was my understanding that the hardware order was informed by a prior decision to cluster 3 machines w... [15:22:22] (03PS1) 10Mark Bergsma: Add codfw power to the total aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286455 [15:22:43] kk Krenair, all good [15:22:49] aude, you're next [15:22:49] !log restarting db1040 for reimage [15:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:26] ok [15:24:48] (03PS2) 10Mark Bergsma: Add codfw power to the total aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286455 [15:26:29] (03CR) 10Mark Bergsma: [C: 032] Add codfw power to the total aggregates [puppet] - 10https://gerrit.wikimedia.org/r/286455 (owner: 10Mark Bergsma) [15:26:30] jouncebot, next [15:26:30] In 1 hour(s) and 33 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160502T1700) [15:28:26] (03CR) 10BBlack: [C: 032] Revert "Set esams down" [dns] - 10https://gerrit.wikimedia.org/r/286438 (owner: 10Giuseppe Lavagetto) [15:28:46] looks like jenkins is almost done [15:28:59] !log re-pooling esams [15:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:00] aude, syncing [15:31:14] ok [15:32:57] !log krenair@tin Synchronized php-1.27.0-wmf.22/extensions/Wikidata: https://gerrit.wikimedia.org/r/#/c/286434/2 (duration: 02m 02s) [15:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:05] aude, ^ [15:33:10] checking [15:33:20] looks ok, afaik [15:33:48] we'll get a new rdf dump later today (and tested on friday on terbium) [15:33:55] ok [15:34:05] thanks [15:34:09] Urbanecm [15:34:19] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Move jobrunner ferm service into the roles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/286415 (owner: 10Muehlenhoff) [15:34:22] ready [15:36:42] (03CR) 10Alex Monk: [C: 032] Add interface editor user group on pswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286285 (https://phabricator.wikimedia.org/T133472) (owner: 10Urbanecm) [15:39:20] (03PS2) 10Alex Monk: Add interface editor user group on pswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286285 (https://phabricator.wikimedia.org/T133472) (owner: 10Urbanecm) [15:39:25] (03CR) 10Alex Monk: [C: 032] Add interface editor user group on pswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286285 (https://phabricator.wikimedia.org/T133472) (owner: 10Urbanecm) [15:40:01] (03Merged) 10jenkins-bot: Add interface editor user group on pswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286285 (https://phabricator.wikimedia.org/T133472) (owner: 10Urbanecm) [15:40:30] (03PS3) 10BBlack: cache_maps: varnish4-only [puppet] - 10https://gerrit.wikimedia.org/r/286427 [15:40:55] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/286285/ (duration: 00m 25s) [15:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:08] (03CR) 10BBlack: [C: 032 V: 032] cache_maps: varnish4-only [puppet] - 10https://gerrit.wikimedia.org/r/286427 (owner: 10BBlack) [15:41:25] James_F, does https://gerrit.wikimedia.org/r/#/c/286286/ have your support? [15:41:38] (03Abandoned) 10BBlack: Revert "Return a custom HTTP 503 response for all the stat1001 websites due to maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/286418 (owner: 10Elukey) [15:41:55] (03PS4) 10Alex Monk: Enable Visual Editor in NS_PROJECT on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [15:42:00] (03CR) 10Alex Monk: [C: 032] Enable Visual Editor in NS_PROJECT on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [15:42:26] (03PS1) 10Elukey: Update httpd access directive to only use Require instead of old Order/Allow/Deny. [puppet] - 10https://gerrit.wikimedia.org/r/286461 (https://phabricator.wikimedia.org/T76348) [15:42:57] (03Merged) 10jenkins-bot: Enable Visual Editor in NS_PROJECT on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286287 (https://phabricator.wikimedia.org/T133980) (owner: 10Urbanecm) [15:44:07] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/286287/ (duration: 00m 25s) [15:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:32] (03PS2) 10Ottomata: Update httpd access directive to only use Require instead of old Order/Allow/Deny. [puppet] - 10https://gerrit.wikimedia.org/r/286461 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [15:45:29] (03PS9) 10Alex Monk: Remove 'https -> http' rewrite for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [15:46:12] (03CR) 10Ottomata: [C: 032] Update httpd access directive to only use Require instead of old Order/Allow/Deny. [puppet] - 10https://gerrit.wikimedia.org/r/286461 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [15:47:37] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: puppet fail [15:48:28] (03CR) 10Jforrester: "Eh. OK. Please add a comment for each numbered namespace explaining what it is, and change the commit message to reflect that it provides " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) (owner: 10Urbanecm) [15:48:42] Krinkle, hi... [15:49:10] I think there's a problem with this change: https://gerrit.wikimedia.org/r/#/c/217858/ [15:49:36] I just looked at the feed and it's already sending https urls [15:51:16] (03PS3) 10Elukey: Update stat1001's httpd access directive to only use Require instead of old Order/Allow/Deny. [puppet] - 10https://gerrit.wikimedia.org/r/286461 (https://phabricator.wikimedia.org/T76348) [15:51:27] (03PS2) 10Alex Monk: Add nlwiki to deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283689 (https://phabricator.wikimedia.org/T118005) [15:51:33] (03CR) 10Alex Monk: [C: 032] Add nlwiki to deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283689 (https://phabricator.wikimedia.org/T118005) (owner: 10Alex Monk) [15:51:59] (03PS4) 10Elukey: Update stat1001's httpd access directive to only use Require instead of old Order/Allow/Deny. [puppet] - 10https://gerrit.wikimedia.org/r/286461 (https://phabricator.wikimedia.org/T76348) [15:52:04] (03Merged) 10jenkins-bot: Add nlwiki to deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283689 (https://phabricator.wikimedia.org/T118005) (owner: 10Alex Monk) [15:53:02] !log krenair@tin Synchronized dblists/all-labs.dblist: https://gerrit.wikimedia.org/r/#/c/283689/ (duration: 00m 26s) [15:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:34] (03PS10) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [15:53:37] !log krenair@tin Synchronized wikiversions-labs.json: https://gerrit.wikimedia.org/r/#/c/283689/ (duration: 00m 25s) [15:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:25] (03PS3) 10Urbanecm: Enable user signature in VE in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) [15:55:00] bd808: any chance you could have a look at https://gerrit.wikimedia.org/r/#/c/286410/? [15:55:30] bd808: I'm removing multicast also from logstash, I want to make sure you know about it before deploying [15:56:20] gehel: *nod* It should be fine. We've actually always provided unicast seeds for cluster discovery I think [15:56:52] (03CR) 10Urbanecm: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) (owner: 10Urbanecm) [15:56:55] bd808: yep, unicast was already configured, but multicast was not disabled. It *should* have zero impact [15:57:08] (03CR) 10BryanDavis: [C: 031] Remove multicast from Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/286410 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [15:57:46] (03PS5) 10Jforrester: Prompt signing in NS_USER, Portal, Wikiproject on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) (owner: 10Urbanecm) [15:57:53] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:59] (03CR) 10Jforrester: [C: 031] "I'm not sure it's a great idea, but fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) (owner: 10Urbanecm) [15:58:47] Krenair: Could we deploy https://gerrit.wikimedia.org/r/#/c/286286/ too please? Thanks! [15:59:28] (03CR) 10Alex Monk: [C: 032] Prompt signing in NS_USER, Portal, Wikiproject on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) (owner: 10Urbanecm) [16:00:15] (03Merged) 10jenkins-bot: Prompt signing in NS_USER, Portal, Wikiproject on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286286 (https://phabricator.wikimedia.org/T133978) (owner: 10Urbanecm) [16:01:19] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/286286/ (duration: 00m 26s) [16:01:23] Urbanecm, ^ [16:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:27] swat is done [16:01:29] jouncebot, next [16:01:29] In 0 hour(s) and 58 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160502T1700) [16:01:45] Thanks Krenair [16:01:59] Thanks Urbanecm too. :-) [16:02:35] You're welcome [16:02:36] (03CR) 10Alex Monk: [C: 04-1] "It looks like someone else already did this. . ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [16:04:33] (03PS1) 10Elukey: Remove NameVirtualHost directive from datasets.wikimedia.org config (2.4 upgrade). [puppet] - 10https://gerrit.wikimedia.org/r/286466 (https://phabricator.wikimedia.org/T76348) [16:06:37] (03CR) 10Elukey: [C: 032] Remove NameVirtualHost directive from datasets.wikimedia.org config (2.4 upgrade). [puppet] - 10https://gerrit.wikimedia.org/r/286466 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [16:16:55] aude, think I've found a bug in wikibase [16:17:05] (03PS11) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [16:18:49] (03PS2) 10Dzahn: DHCP: changing the install to trusty to test since jessie is not detecting the disks Bug: T132976 [puppet] - 10https://gerrit.wikimedia.org/r/286266 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [16:19:18] (03CR) 10Dzahn: [C: 032] DHCP: changing the install to trusty to test since jessie is not detecting the disks Bug: T132976 [puppet] - 10https://gerrit.wikimedia.org/r/286266 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [16:24:38] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2257101 (10EBernhardson) This project does replace nobelium. Nobelium will be decommissioned and returned to the pool. I think the differenc... [16:31:07] (03PS1) 10Alex Monk: restbase: Add beta nlwiki [puppet] - 10https://gerrit.wikimedia.org/r/286476 (https://phabricator.wikimedia.org/T118005) [16:32:57] Krinkle: due to https://phabricator.wikimedia.org/T133515, there's going to be maybe a month or so during which you won't be able to update nagf yourself, starting in a week or two. Hopefully that is ok [16:36:06] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2257143 (10demon) [16:36:53] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2257149 (10Dzahn) @demon Is furud going to be used or not needed anymore? [16:39:01] Krenair ? [16:39:04] (03PS12) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [16:39:51] aude, I sent a change to fix it [16:40:53] ok [16:41:10] oh that [16:41:17] sorry... [16:41:24] (03CR) 10DCausse: [C: 031] Remove multicast from Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/286410 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [16:41:27] 06Operations, 10Analytics, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/api - https://phabricator.wikimedia.org/T92338#2257172 (10Nuria) [16:41:35] 06Operations, 10Analytics, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#2257175 (10Nuria) [16:41:40] 06Operations, 10Analytics, 06Zero, 07Mobile, and 2 others: Purge > 90 days stat1002:/a/squid/archive/mobile - https://phabricator.wikimedia.org/T92341#2257177 (10Nuria) [16:41:44] 06Operations, 10Analytics, 06Zero, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/sampled - https://phabricator.wikimedia.org/T92342#2257179 (10Nuria) [16:41:47] 06Operations, 10Analytics, 06Zero, 05Security, 07audits-data-retention: Purge > 90 days stat1002:/a/squid/archive/zero - https://phabricator.wikimedia.org/T92343#2257181 (10Nuria) [16:42:11] Krenair: ok for now, but when you have time, would welcome input on https://phabricator.wikimedia.org/T90617 [16:47:29] (03PS2) 10Dzahn: admin: 'service nodepool restart' for contintadmins [puppet] - 10https://gerrit.wikimedia.org/r/286428 (https://phabricator.wikimedia.org/T133990) (owner: 10Hashar) [16:47:36] (03CR) 10Dzahn: [C: 032] admin: 'service nodepool restart' for contintadmins [puppet] - 10https://gerrit.wikimedia.org/r/286428 (https://phabricator.wikimedia.org/T133990) (owner: 10Hashar) [16:47:42] thcipriani: ^^^ [16:48:30] hashar: speak of the patch, and it shall appear to you :) [16:48:41] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/286428 (https://phabricator.wikimedia.org/T133990) (owner: 10Hashar) [16:49:07] restart is apparently identical to "stop/start" [16:49:21] but yea [16:50:49] mutante: yeah that is merely for convenience [16:51:01] yep :) [16:51:25] thcipriani: then you can retitle the task ;-} [16:51:29] I am off for dinner [16:52:11] mutante: yeah, the original task was for a broader expansion of permissions for releng on that node. [16:52:22] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: add nodepool restart to contint-admins - https://phabricator.wikimedia.org/T133990#2251528 (10Dzahn) should be resolved after puppet runs (max. 30 min) [16:52:46] thcipriani: we have sudo rule to force run puppet on labnodepool if you want to give its try [16:52:54] thcipriani: makes sense, and i agree it was missing in the first place [16:53:18] thcipriani: should be: sudo /usr/local/sbin/puppet-run , does not yield any output though [16:53:29] but once it has completed sudo -l should show the restart being granted [16:54:11] * thcipriani runs puppet [16:54:35] thcipriani: also: cat /etc/sudoers.d/contint* [16:54:59] mutante: did anyone test that restart for that script actually works? [16:55:33] jzerebecki: i think that's happening in a second [16:56:34] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2158516 (10yuvipanda) So my understanding of the commitment from the labs team is: 1. We poke some holes in the labs-instance/labs-support fi... [16:57:06] * jzerebecki shadows thcipriani [16:57:15] :) [16:58:16] sudo -l now shows /usr/sbin/service nodepool restart for my user [17:00:04] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160502T1700). [17:00:12] :) sounds resolved then.. well the access part [17:00:21] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail [17:00:25] now it just needs to actaully restart [17:01:21] SMalyshev: are you around? [17:02:58] (03Restored) 10Dzahn: add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) (owner: 10Dzahn) [17:03:16] (03PS7) 10Dzahn: add AAAA record for argon (irc) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) [17:06:09] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2257261 (10demon) Nah, we won't need it after all. Plan to have patches up for decom by end of the week. [17:07:41] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2257262 (10Dzahn) Ok, thanks. I'll help with that. Just wanted to know. Focus on killing antimony first :) [17:08:10] mobrovac: You about? We're working on the new restbase codfw systems [17:08:11] and we have an issue with disk detection in the installer [17:08:24] I'd like to take offline one of restbase200[1-6] to compare it [17:08:27] since they should be identical. [17:08:33] uh [17:08:41] if thats possible [17:08:50] urandom: ^ ? [17:09:06] I realize it sucks I'm asking you to further reduce your footprint =[ [17:09:33] robh: no, that should be fine I think [17:09:51] I'm not sure why the older systems installed perfectly when we did them, but these systems with identical hardware aren't. My next step is to comapre all the bios settings, and then the controller settings, and then have papaul open them up and compare cabling [17:09:54] cool [17:10:04] is there a particular one that is best [17:10:05] ? [17:10:18] robh: moritz suggested also to boot from a live CD [17:10:18] mmm, maybe 2004? [17:10:19] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2257267 (10EBernhardson) Sounds accurate to me [17:10:21] rb2004 probably [17:10:25] 06Operations, 10Wikimedia-IRC-RC-Server, 07IPv6: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#2257268 (10Dzahn) @Krinkle should we add the record during migration anyways? [17:10:25] robh: ^ [17:10:36] it really shouldn't matter which, though [17:10:47] mobrovac: awesome, we'll do a clean shutdown of it. it may come back online and have to offline again, is there someplace we can just depool it? [17:11:02] so we don't get load on it to just reboot again [17:11:27] papaul: Yep, that may be our next bet, but I want to compare to a known good host. [17:11:38] robh:ok [17:11:43] step, not bet, you know what i mean =] [17:11:49] robh: codfw is not used for live traffic, so no need [17:12:01] mobrovac: awesome, thank you both [17:12:44] ok, papaul since codfw isnt live, we dont have to rush this. lets take down restbase2004 into the bios, and you should do a direct comparison of it against one of the new systems (id advise 2007 or 2008 since they have the same ssds) [17:12:55] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review: add nodepool restart to contint-admins - https://phabricator.wikimedia.org/T133990#2257272 (10Dzahn) 05Open>03Resolved a:03Dzahn 10:03 < thcipriani> sudo -l now shows /usr/s... [17:12:59] dont change anything on 2004. i'll shut it down cleanly now [17:13:11] and make either 2007 or 2008 match identically in bios and controller settings [17:13:29] I can do remotely, but its much faster with a crash cart on these sometimes, up to you [17:13:33] 06Operations, 10Ops-Access-Requests, 10Continuous-Integration-Infrastructure, 06Release-Engineering-Team: Allow RelEng nova log access - https://phabricator.wikimedia.org/T133992#2251644 (10Dzahn) this has been approved in ops meeting [17:13:45] i found getting into the hp bios annoying when you miss the 2 second window ;] [17:14:14] Once the bios and controller settings are confirmed identical, and if you didnt have to change anything [17:14:16] (03PS1) 10Elukey: Revert "Return a custom HTTP 503 response for all the stat1001 websites due to maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/286481 [17:14:27] you should pop open the cases and ensure the new systems backplanes are wired the same [17:14:40] we may have just a disconnected or mis-connected cable. [17:14:47] thcipriani: uh you didn't actually run restart? [17:14:49] recall that 2004 works, so no changes to it =] [17:15:19] papaul: restbase2004 will shutdown in 1 minute. [17:15:23] jzerebecki: no I didn't. [17:15:55] (03PS2) 10Elukey: Revert "Return a custom HTTP 503 response for all the stat1001 websites due to maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/286481 [17:18:03] restbase2004 down, scheduled? [17:18:17] yes [17:18:26] should read 3 lines before, sorry [17:19:04] (in my defense, "downtime") [17:19:20] PROBLEM - Host restbase2004 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:10] sorry, i should have put in maint [17:20:13] doing so now [17:20:21] but it may flap back up though [17:20:37] not ack, downtime for days if needed [17:20:53] it is free! [17:20:59] pages aren't! [17:21:07] :-D [17:21:08] it paged you? [17:21:13] no [17:22:23] ok, its downtimes (and all services) for the next few hours [17:22:36] yeah, pages to the eu are most certainly not free =P [17:23:49] !log restbase2004 offline for next few hours for comparison work for new systems T132976 [17:23:50] T132976: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976 [17:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:21] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:31:02] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/2643/cp1061.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/286481 (owner: 10Elukey) [17:33:02] thcipriani: then I'm doing that now, you have a fire extinguisher ready? yay disintegration testing! [17:35:40] (03PS1) 10Hashar: admin: contint-admins no more need postgres [puppet] - 10https://gerrit.wikimedia.org/r/286484 [17:36:08] what is sinistra? [17:36:20] the host name in icinga [17:36:45] it has RAID fail [17:39:08] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: "Elastica: missing curl_init_pooled method" due to mwscript job running with PHP 5 on terbium - https://phabricator.wikimedia.org/T132751#2257317 (10EBernhardson) [17:39:12] you don't mean sinister? [17:39:37] (03PS1) 10EBernhardson: cirrus: Only use curl pools on hhvm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286485 (https://phabricator.wikimedia.org/T132751) [17:39:56] jzerebecki: no, sinistra as in " installing sinistra :new mw log host" i just found in SAL [17:42:11] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 676 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5411761 keys - replication_delay is 676 [17:42:25] jzerebecki: ready as I'll ever be :) [17:42:53] 06Operations: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2257327 (10Dzahn) [17:43:17] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2257339 (10Dzahn) [17:44:05] ACKNOWLEDGEMENT - RAID on sinistra is CRITICAL: CRITICAL: Active: 6, Working: 6, Failed: 2, Spare: 0 daniel_zahn https://phabricator.wikimedia.org/T134187 [17:44:44] jzerebecki: there is a rake-jessie job running, so that's good :) [17:44:51] thcipriani: thx [17:48:53] (03PS1) 10Dzahn: add sinistra.codfw to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/286487 (https://phabricator.wikimedia.org/T128796) [17:49:06] mutante: my best guess is a future replacement for that host that has actual files of the mediawiki logs as opposed to logstash [17:50:20] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5410898 keys - replication_delay is 0 [17:55:34] robh: compared restbase2004 and restbase2008 all the cables are in place the same way [17:55:51] same configuration [17:56:04] how about bios and controller config in software? [17:56:05] robh: i am going to check the BIOS [17:56:08] cool [17:56:14] i was hoping it was something easy like cables, oh well! [17:57:21] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:26] (03CR) 10Dzahn: [C: 032] add sinistra.codfw to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/286487 (https://phabricator.wikimedia.org/T128796) (owner: 10Dzahn) [17:58:11] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:59:44] (03CR) 10Krinkle: "No? Latest master: https://github.com/wikimedia/operations-mediawiki-config/blob/41e8da24330fc2aedd61127861d4493ee32dd631/wmf-config/Commo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [17:59:49] Krenair: ^ [17:59:51] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:00:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:00:56] Krinkle, I saw that but it no longer works... look at the feed [18:03:51] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:07:07] Krenair: Hm.. indeed [18:07:22] (03PS10) 10Krinkle: Remove 'https -> http' rewrite for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [18:10:02] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:21] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:11:10] Krenair: array(2) { [18:11:10] ["$wgInternalServer"]=> [18:11:11] string(20) "//test.wikipedia.org" [18:11:11] ["$wgServer"]=> [18:11:11] string(20) "//test.wikipedia.org" [18:11:11] } [18:11:20] Over https://test.wikipedia.org/w/krinkle.php (mw1017) [18:11:28] ["HTTP_X_FORWARDED_PROTO"]=> [18:11:28] string(5) "https" [18:11:31] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:11:47] Krenair: It assumed it was not protocol relative. [18:11:51] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:12:01] It predates us going HTTPS everywhere [18:12:10] it was added to avoid opt-in HTTPS requests from producing HTTPS notifications [18:12:17] HTTP was canonical at the time [18:12:33] and probably inside MediaWiki we do (as we should) expandUrl with PROTO_CANONICAL [18:14:46] (03PS11) 10Krinkle: Remove obsolete 'https -> http' rewrite for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [18:15:02] (03CR) 10Krinkle: [C: 031] "Should be safe to merge anytime, no impact." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [18:17:21] SMalyshev: I'm getting ready to deploy latest WDQS. Just finishing to run test queries on beta [18:17:42] gehel: great [18:20:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:22:00] SMalyshev: running the test queries on beta, minus.sparql is taking forever and finally fails. Is that expected? [18:22:22] gehel: depends on the query. Some of them are long [18:22:30] yeah, minus is one of those [18:22:46] Ok, so let's pretend I did not see it... [18:23:27] ottomata: Do you know why wmgUseEventBus is false for wikitech? (I understand for private, login and vote) [18:24:33] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 643 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5413747 keys - replication_delay is 643 [18:24:48] !log deploying latest WDQS version [18:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:27:37] Krinkle: I don't know, Pchelolo might? [18:27:41] or maybe urandom? [18:28:35] Krinkle: I do not know [18:28:52] gwicke: ^ [18:29:05] It should be reachable, right? [18:31:29] SMalyshev, aude: latest WDQS deployed, test queries passing. I'll let you do other checks if needed. [18:31:39] great, thanks! [18:31:48] Krinkle: I'm not aware of any specific reason [18:32:06] (03PS3) 10Dzahn: Don't publish Wikidata dumps if a shared failed [puppet] - 10https://gerrit.wikimedia.org/r/286411 (https://phabricator.wikimedia.org/T133924) (owner: 10Hoo man) [18:32:44] (03PS4) 10Dzahn: Don't publish Wikidata dumps if a shard failed [puppet] - 10https://gerrit.wikimedia.org/r/286411 (https://phabricator.wikimedia.org/T133924) (owner: 10Hoo man) [18:32:44] Krinkle: what does `git blame` have to say about that line? [18:32:56] gwicke: ottomata : https://gerrit.wikimedia.org/r/#/c/266564/1/wmf-config/InitialiseSettings.php [18:32:57] gehel: looks good [18:33:05] (03CR) 10Dzahn: [C: 032] Don't publish Wikidata dumps if a shard failed [puppet] - 10https://gerrit.wikimedia.org/r/286411 (https://phabricator.wikimedia.org/T133924) (owner: 10Hoo man) [18:33:44] (03PS1) 10Rillke: Add UploadsLink to production extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286494 (https://phabricator.wikimedia.org/T130018) [18:33:53] Krinkle: kk, best to ask urandom about it, I think [18:38:26] (03Abandoned) 10Rillke: Add UploadsLink to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281646 (https://phabricator.wikimedia.org/T130018) (owner: 10Rillke) [18:38:58] (03PS3) 10Nuria: Read values inbound in X-Analytics header (pageview and preview) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) [18:40:18] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2257430 (10Dzahn) sinistra is not in service yet, so it's not affecting anything right now, but it's going to be the new MW logging host for codfw, -> T128796 [18:40:25] (03CR) 10Nuria: Read values inbound in X-Analytics header (pageview and preview) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) (owner: 10Nuria) [18:40:40] (03PS4) 10Nuria: Read values inbound in X-Analytics header (pageview and preview) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) [18:41:06] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2257436 (10Dzahn) [18:41:09] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, and 2 others: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2086341 (10Dzahn) [18:44:43] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5416446 keys - replication_delay is 0 [18:44:45] (03CR) 10Dzahn: [C: 031] "confirmed the other poolcounter already has it. so that seems good.. still.. poolcounter. wah :)" [puppet] - 10https://gerrit.wikimedia.org/r/286148 (owner: 10Muehlenhoff) [18:56:15] (03CR) 10Dzahn: [C: 031] "compiler diff, looks right: http://puppet-compiler.wmflabs.org/2644/ocg1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/286068 (https://phabricator.wikimedia.org/T133864) (owner: 10Cscott) [19:00:10] (03CR) 10Dzahn: "maybe we can put the "$ocg_decommission = true" part into hiera, in ./hieradata/hosts/ocg1003.yaml instead of site.pp ?" [puppet] - 10https://gerrit.wikimedia.org/r/286070 (https://phabricator.wikimedia.org/T84723) (owner: 10Cscott) [19:17:50] !log manually removing 2fa from my own wikitech account, adding it back .. [19:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:24:14] anybody familiar with new_wmf_service.py? Getting some strange errors from it: https://phabricator.wikimedia.org/P2990 [19:24:18] (03CR) 10Dzahn: "added to Evening SWAT deploy today 23:00–00:00 UTC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286258 (https://phabricator.wikimedia.org/T134017) (owner: 10Dzahn) [19:25:39] Krinkle: was there a specific time for irc.wm, or just May 2 [19:27:23] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2257519 (10Dzahn) added to Evening SWAT deploy today 23:00–00:00 UTC https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0May.C2.A002 htt... [19:27:51] !log aaron@tin Synchronized php-1.27.0-wmf.22/includes/filebackend/FileBackendMultiWrite.php: 63b2d7b2eae (duration: 00m 32s) [19:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:02] Hi. [19:31:39] ottomata: do you know about new_wmf_service.py? [19:31:39] mutante: for previous wikis, a whole specific window was dedicated for the creation, as there is db work [19:31:55] SMalyshev: no [19:32:02] what is? [19:32:30] mutante: yet, I think the patch is comprehensive, contains all the required config [19:32:40] Dereckson: hey, but are you saying it cant be merged and breaks anything? [19:32:55] if it doesnt, then i'm for just moving forward, as there is always more to do [19:33:04] ottomata: you are on the commit list for it, so I thought maybe you know :) I'll wait for akosiaris then. [19:33:36] mutante: you could ask Krenair if he would be comfortable to handle it during a SWAT as he did ady.wikipedia recently. [19:33:54] Dereckson: ok, i won [19:34:03] i won't be mad if it gets rejected for swat :) [19:34:32] (03CR) 10BBlack: [C: 04-1] Read values inbound in X-Analytics header (pageview and preview) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) (owner: 10Nuria) [19:34:32] From what I've seen of the procedure, it's not as much the matter of breaking things than the duration. [19:35:58] it doesnt have to happen all at once [19:36:05] perfect is the enemy of good [19:37:22] Dereckson: but back to irc.wm.org for now. are you involved in that at all? [19:39:44] No, I'm not. But if nobody else takes the patch, as it has been planned for some weeks to this SWAT, I'm okay to watch it and test if works fine on irc.wm.org after deploy. [19:41:07] which one? [19:41:23] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 644 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5426260 keys - replication_delay is 644 [19:41:53] mutante: T122933 / 217858 [19:41:53] T122933: Remove the "HTTPS to HTTP" url filter in the IRC feed - https://phabricator.wikimedia.org/T122933 [19:42:37] Dereckson: ah.. hmm . Do you know of a specific time for that or just "May 2nd"? [19:42:55] also, we could add the IPv6 now [19:43:09] It's a date fixed in the Gerrit discussion to be 1st may [19:43:13] reported to the 2 as 1 is a Sunday [19:43:32] it was a matter of "ok we can technically do it now but needs to announce it [19:43:35] " [19:43:52] Dereckson: T123729 [19:43:52] T123729: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729 [19:44:18] https://phabricator.wikimedia.org/T123729#2216681 [19:44:46] So you wish to postpone the change to HTTPS to be right after the migration or a part of it? [19:44:50] and https://phabricator.wikimedia.org/T105422 [19:45:15] no, i'm asking when the migration is planned to happen [19:45:39] yea, all part of it [19:45:47] also the IPv6 part.. now or never.. i guess [19:46:23] Krinkle and Krenair > any thought on this ? ^ [19:46:52] the ircd is runnign and the bot is up, but it doesnt output stuff yet [19:47:13] it probably will once the appservers start sending data [19:49:42] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures [19:49:48] (03PS1) 10Dzahn: add IPv6 record for kraz.wm.org [dns] - 10https://gerrit.wikimedia.org/r/286504 (https://phabricator.wikimedia.org/T105422) [19:50:14] (03PS2) 10Dzahn: add IPv6 record for kraz.wm.org [dns] - 10https://gerrit.wikimedia.org/r/286504 (https://phabricator.wikimedia.org/T105422) [19:50:46] (03CR) 10Dzahn: [C: 032] add IPv6 record for kraz.wm.org [dns] - 10https://gerrit.wikimedia.org/r/286504 (https://phabricator.wikimedia.org/T105422) (owner: 10Dzahn) [19:51:03] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5421663 keys - replication_delay is 0 [19:51:05] (03PS1) 10Brion VIBBER: Enable $wgMFStripResponsiveImages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286505 [19:51:38] 06Operations, 10Wikimedia-IRC-RC-Server, 07IPv6, 13Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#2257598 (10Dzahn) 05declined>03Open [19:52:17] (03PS1) 10BBlack: multicert + libssl1.0.2 patches for 1.10.0 [software/nginx] (wmf-1.10.0-1) - 10https://gerrit.wikimedia.org/r/286506 (https://phabricator.wikimedia.org/T96848) [19:52:19] (03PS1) 10BBlack: Remove --automatic-dbgsym on dyn mod dh_strip [software/nginx] (wmf-1.10.0-1) - 10https://gerrit.wikimedia.org/r/286507 (https://phabricator.wikimedia.org/T96848) [19:52:21] (03PS1) 10BBlack: nginx (1.10.0-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.10.0-1) - 10https://gerrit.wikimedia.org/r/286508 (https://phabricator.wikimedia.org/T96848) [19:54:01] gerrit, you're annoying :P [19:54:18] 06Operations, 10Traffic, 10Wikimedia-IRC-RC-Server, 07HTTPS, and 2 others: Remove the "HTTPS to HTTP" url filter in the IRC feed - https://phabricator.wikimedia.org/T122933#2257604 (10Dzahn) [19:54:20] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2257603 (10Dzahn) [19:54:30] 06Operations, 10Wikimedia-IRC-RC-Server, 07IPv6, 13Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#2257608 (10Dzahn) [19:54:32] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#1936574 (10Dzahn) [19:54:44] (03PS1) 10Dzahn: switch irc.wm.org from argon to kraz [dns] - 10https://gerrit.wikimedia.org/r/286509 (https://phabricator.wikimedia.org/T123729) [19:54:47] (03Abandoned) 10BBlack: Import nginx.org 1.9.14-1.9.15 diffs [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284892 (owner: 10BBlack) [19:54:53] (03Abandoned) 10BBlack: nginx (1.9.15-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284077 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [19:55:02] (03Abandoned) 10BBlack: Remove --automatic-dbgsym on dyn mod dh_strip [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284920 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [19:55:07] (03Abandoned) 10BBlack: multicert + libssl1.0.2 patches for 1.9.15 [software/nginx] (wmf-1.9.14-1) - 10https://gerrit.wikimedia.org/r/284075 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [19:56:38] Dereckson: yea, so getting the wmf-config change today would be good timing, in SWAT would be great [19:57:05] i'm just not sure about ircecho working [19:57:12] trying to find out more..hrmm [19:57:26] or if anyone can help, the server is up [19:57:31] (03CR) 10BBlack: [C: 032 V: 032] multicert + libssl1.0.2 patches for 1.10.0 [software/nginx] (wmf-1.10.0-1) - 10https://gerrit.wikimedia.org/r/286506 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [19:57:43] (03CR) 10BBlack: [C: 032 V: 032] Remove --automatic-dbgsym on dyn mod dh_strip [software/nginx] (wmf-1.10.0-1) - 10https://gerrit.wikimedia.org/r/286507 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [19:57:54] (03CR) 10BBlack: [C: 032 V: 032] nginx (1.10.0-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.10.0-1) - 10https://gerrit.wikimedia.org/r/286508 (https://phabricator.wikimedia.org/T96848) (owner: 10BBlack) [19:59:58] mutante: ok [20:00:04] gwicke cscott arlolra subbu bearND: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160502T2000). Please do the needful. [20:00:48] Dereckson: also it has an "can be merged anytime" comment [20:01:19] (03PS1) 10Rush: labstore1004/1005 host entries for DHCP [puppet] - 10https://gerrit.wikimedia.org/r/286512 (https://phabricator.wikimedia.org/T133397) [20:02:55] Robh: you can put restbase2004 back in service thanks [20:03:09] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2257629 (10Papaul) I compared restbase2004 and restbase2008, all the cables are connected the same way. The Bios setting are also the same. the issue was with the Smart Array P440ar configuration .You nee... [20:05:55] !log starting deploy of parsoid version 0a26f3a4 [20:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:07:27] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 4 others: "Elastica: missing curl_init_pooled method" due to mwscript job running with PHP 5 on terbium - https://phabricator.wikimedia.org/T132751#2257632 (10EBernhardson) a:03EBernhardson [20:09:23] !log synced code + restarted parsoid on wtp1001 as canary [20:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:20] (03CR) 10Rush: [C: 032] labstore1004/1005 host entries for DHCP [puppet] - 10https://gerrit.wikimedia.org/r/286512 (https://phabricator.wikimedia.org/T133397) (owner: 10Rush) [20:11:36] (03PS1) 10Dzahn: Revert "DHCP: changing the install to trusty to test since jessie is not detecting the disks Bug: T132976" [puppet] - 10https://gerrit.wikimedia.org/r/286515 [20:13:58] (03PS3) 10Gehel: Remove multicast from Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/286410 (https://phabricator.wikimedia.org/T110236) [20:14:05] (03PS2) 10Dzahn: Revert "DHCP: changing the install to trusty to test since jessie is not detecting the disks Bug: T132976" [puppet] - 10https://gerrit.wikimedia.org/r/286515 [20:14:28] (03CR) 10Dzahn: [C: 032] Revert "DHCP: changing the install to trusty to test since jessie is not detecting the disks Bug: T132976" [puppet] - 10https://gerrit.wikimedia.org/r/286515 (owner: 10Dzahn) [20:14:47] (03PS1) 10Rush: labstore1004 dhcp typo [puppet] - 10https://gerrit.wikimedia.org/r/286516 (https://phabricator.wikimedia.org/T133397) [20:14:53] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:15:15] !log finished deploying parsoid version 0a26f3a4 [20:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:56] (03PS2) 10Rush: labstore1004 dhcp typo [puppet] - 10https://gerrit.wikimedia.org/r/286516 (https://phabricator.wikimedia.org/T133397) [20:17:05] (03CR) 10Rush: [C: 032 V: 032] labstore1004 dhcp typo [puppet] - 10https://gerrit.wikimedia.org/r/286516 (https://phabricator.wikimedia.org/T133397) (owner: 10Rush) [20:17:33] (03PS4) 10Gehel: Remove multicast from Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/286410 (https://phabricator.wikimedia.org/T110236) [20:20:00] (03CR) 10Gehel: [C: 032] Remove multicast from Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/286410 (https://phabricator.wikimedia.org/T110236) (owner: 10Gehel) [20:20:36] mutante: what is happening with ircecho ? :( [20:21:10] hashar: it's not broken on the active server.. but we want to see it work on the new server [20:21:14] and so far we dont [20:21:22] even though it should get some data from test2.wp [20:21:40] if we could confirm it works, we could just switch over today [20:21:45] !log starting rolling restart of elasticsearch codfw cluster to disable multicast (T110236) [20:21:45] I would blame firewalls ? [20:21:45] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [20:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:09] hashar: that is entirely possible.. ! checking [20:22:17] mutante: are you sure the mw app servers send to the proper IP ? [20:22:37] then hook on the destination IP, disable ferm/iptables (maybe) tcpdump to verify packets come just fine [20:22:47] Dereckson, mutante, what's up? [20:22:49] (sorry that is the lame low level idea debugging I have sorry) [20:23:37] !log restarting elasticsearch server elastic2001.codfw.wmnet (T110236) [20:23:38] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [20:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:46] hashar: ori switched a canary server over [20:23:47] Krenair: Daniel would like to know how to debug the echo bot when irc server is migrated [20:23:55] a couple days ago [20:24:33] maybe it is no more switched [20:24:33] mutante, did you modify udpmxircecho to print when it sends messages? [20:24:52] i removed the iptables rules to confirm that theory [20:25:04] no, i did not modify udpmxircecho [20:25:10] iirc mediawiki just UDP send to a listening daemon, so you could echo |nc [20:26:42] (03PS1) 10Rillke: Enable UploadsLink at Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286517 (https://phabricator.wikimedia.org/T130018) [20:26:43] papaul: sorry was afk for lunch [20:26:59] did you find any differences between restvase2004 and 2007 (or 8 not sure which you were using) [20:27:19] you can power up restbase2004 normally [20:27:22] Krenair: do you know anything about the time today? [20:27:23] and it should return to normal use [20:27:43] robh: no problem [20:27:49] (03CR) 10Jforrester: [C: 031] Enable UploadsLink at Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286517 (https://phabricator.wikimedia.org/T130018) (owner: 10Rillke) [20:27:56] robh: i already did that [20:28:01] mutante, the time? [20:28:07] papaul: oh, its not responsive to ssh [20:28:14] i'll drill into ilom [20:28:18] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2257672 (10BBlack) I've built new `1.10.0-1+wmf1` packages and uploaded those to carbon and upgraded cp1008. These have no true code changes from the last 1.9.15-1+wmf1 test package,... [20:28:34] Krenair: for the migration of irc servetr [20:28:43] i see May 2nd [20:28:51] where do you see that? [20:28:57] robh: working for me [20:29:19] papaul: you can ssh into restvase2004.codfw.wmnet? [20:29:24] sorry, restbase2004.codfw.wmnet [20:29:27] (03CR) 10Rillke: "Scheduled for 2016-05-04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286517 (https://phabricator.wikimedia.org/T130018) (owner: 10Rillke) [20:29:44] robh: no mgmt [20:29:51] Krenair: https://phabricator.wikimedia.org/T123729#2216681 [20:29:58] ok, i didnt say mgmt didnt work ;] ssh into production doesnt work [20:30:01] even though its posted the os in mgmt [20:30:01] Virtual Serial Port Active: COM2 [20:30:01] Starting virtual serial port. [20:30:01] Press 'ESC (' to return to the CLI Session. [20:30:02] Debian GNU/Linux 8 restbase2004 ttyS1 [20:30:04] restbase2004 login: [20:30:06] right [20:30:10] i never said mgmt wasnt working ;] [20:30:26] so was there any changes on the new systems made to match 2004 or was it already identical? [20:30:26] Krenair: we currently need the "verify that it works" step [20:30:44] mutante, you got it working? [20:30:55] Krenair: no.. lol [20:30:58] Last time we discussed this we verified that it was *not* working [20:31:03] we are really starting to go in circles, hehe [20:31:09] We can no longer stick to that plan because you needed to have done this *before* today [20:31:23] that's why i'm asking when the migration is happening [20:31:35] Not today and that post is no longer useful [20:31:35] i'm not planning to modify the bot , fwiw [20:31:44] Then how do you plan to investigate the problem with it? [20:31:56] what do you mean "you needed to have that done before today" [20:31:59] i never set that date [20:32:05] neither did I [20:32:06] lol [20:32:23] i'm planning to investigate the problem by asking here , right now [20:32:52] ok, odd, restbase2004 has serial and os, but no networking [20:32:54] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:33:04] I don't have access to either the old or the new server, do you want me to send data to it, or...? [20:33:07] papaul: im rebooting it but i suspect that perhaps a network cable isnt seated? [20:33:15] can you please check? [20:33:25] What can I do to help? [20:33:48] what can I do to help? [20:33:58] robh: already left dc [20:34:14] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5427463 keys - replication_delay is 0 [20:34:21] well, restbase in codfw isnt online so thats ok, but next time we take offline a server [20:34:26] before you leave we need to bring it back [20:34:28] i think i'll create a tcpdump file then and paste it somehwer [20:34:31] so while it posted and you saw it load [20:34:38] next time we shoudl test the ssh to the os [20:34:47] im checking the network switch now to see if it sees 2004 [20:34:55] (i realize i disappeared for lucnh in the middle of your owrk too@!) [20:35:06] so its not a big deal, as i said, restbase in codfw isnt online yet [20:35:18] (so you dont need to drive back until tomorrow for normal work =) [20:35:20] will check that once back tomorrow [20:35:39] funny how a date is being set and then it's my problem to fix it.. when nobody ever wants to touch the bot [20:35:46] of course [20:35:55] ge-5/0/13 up down restbase2004 [20:35:59] yep, the cable isnt seated [20:36:08] Nobody else can touch it [20:36:30] papaul: So, back to the original issue. How did troubleshooting go and did you find anything to change in restbase200[789]? [20:36:50] Krenair: that is incorrect [20:37:08] see ticket [20:37:13] non-ops can do it? [20:37:17] papaul: https://phabricator.wikimedia.org/T132976? [20:37:32] ahh, i see [20:37:47] cool, so you did for all of the new systems and we are good for install? [20:38:03] robh: cassandra hosts that are down for more than the hint window (three hours in our setup) will be inconsistent & will need a manual repair before re-joining the cluster [20:38:18] gwicke: eww, uhh, shit [20:38:21] sorry =[ [20:38:32] I think we are outside that window already [20:38:38] it's not the end of the world, just good to avoid if possible [20:38:39] let alone even if papaul drives back down immediately [20:38:45] duly noted for the future! [20:39:20] we might be able to expand the hint window a bit further in the future [20:39:25] we definitely should [20:39:27] after moving to 2.2 [20:39:49] (03PS5) 10Nuria: Read values inbound in X-Analytics header (pageview and preview) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) [20:40:12] if we lose e.g. a network row for some reason, it will probably take us over 3 hours to fix it (esp. if it's e.g. on a weekends and we need hands) [20:40:40] that would be 1/4th of the cassandra nodes if we distributed things evenly [20:40:43] 06Operations, 10ops-codfw, 10RESTBase: plug in restbase2004 network cable - https://phabricator.wikimedia.org/T134197#2257691 (10RobH) [20:40:47] so it wouldn't be pretty :) [20:40:55] 3.x has some more improvements that should enable > 24 hour hint windows, once we switch to it [20:41:03] good to know! [20:41:24] yeah you have to assume for any one of our onsites to get to any single of our datacenters it is a 60 minute transit [20:41:31] that is 1/3rd of the downtime window, heh [20:41:35] in any case, in an emergency we *can* defer the repair, at the cost of some minor inconsistencies [20:42:02] it's nothing as drastic as losing a node [20:42:32] just more expensive to figure out what differs, rather than replaying the recorded changes [20:42:51] also it's unlikely that it will happen but it can happen: we may lose all eqiad-codfw connectivity for > 3 hours [20:43:06] so that would also be half of the nodes I suppose :) [20:43:35] but yeah, low probability and if there is a fix in the horizon then I guess all is good for now [20:44:27] yup [20:46:48] it's still eventually consistent as long as the repair is run before gc_grace (2 days in our case) elapses [20:47:43] and even after that, the only inconsistency that's possible is that deleted items could come back [20:50:10] Krenair: ok, i have something you could help me with, switch test2.wp to send data to the new server [20:50:26] apparently ori's change was just temp [20:51:10] yeah that wasn't going to stick forever, will do it but it has to be based on X-Wikimedia-Debug instead of per-site [20:51:39] opens the wikitech page about X-Debug [20:51:48] (03PS1) 10Rillke: Commons: Restrict changetags userright [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) [20:51:55] there's chrome and firefox plugins for it [20:52:20] installs bd808's extension [20:53:24] * bd808 backdoors mutante's browser [20:53:35] Oh my [20:54:00] the source is on github. it's pretty boring [20:54:22] * mutante trusted the "has been preliminary reviewed" by Google ::p [20:54:56] i got mw1017, mw1099, 2017 and 2099 to pick from [20:55:01] mw1017 [20:55:04] 'k [20:56:02] (03CR) 10Jforrester: "Isn't this already the case for all WMF wikis?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) (owner: 10Rillke) [20:56:09] i edited, i saw in tcpdump [20:56:15] how mw1017 sends data to kraz [20:56:26] but it doesnt talk [20:56:39] that's the same we knew before [20:56:58] so the things that hashar said are already checked [20:57:00] I set it back to double check it's working with the existing server [20:57:04] but I'm not convinced [20:57:09] i just tested it before , it worked [20:57:43] 13:53 <@rc-pmtpa> [[Foo24]] ! https://test2.wikipedia.org/w/index.php?diff=283085&oldid=283084&rcid=451380 * 50.0.125.138 * (-8) [20:57:48] old server ^ [20:58:25] yeah thing is my edit hasn't gone through [20:58:44] it was still sent to new server it looks [20:58:52] and there the bot is a blackhole [20:59:05] ? [20:59:19] 20:56:22.302377 IP mw1017.eqiad.wmnet.48799 > kraz.wikimedia.org.9390: UDP, length 171 [20:59:22] 20:58:03.393310 IP mw1017.eqiad.wmnet.38594 > kraz.wikimedia.org.9390: UDP, length 156 [20:59:26] that second one.. was your attempt [20:59:29] i think [21:00:02] how to send data into it from localhost? [21:00:32] tries [21:00:56] there was one more attempt right now [21:01:03] My edits to enwiki and testwiki go through, but not test2wiki [21:01:12] ooh..hmm [21:01:21] gonna consider that a separate bug and put mw1017 on the new irc server [21:01:24] i kind of like that [21:01:34] at least that means something special about test2 [21:01:38] that is tied to eqiad i guess [21:01:42] (03CR) 10Rillke: "Apparently not: https://commons.wikimedia.org/w/index.php?title=Special%3ALog&type=tag&user=&page=&year=&month=-1&tagfilter=&hide_thanks_l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) (owner: 10Rillke) [21:01:50] (03PS6) 10BBlack: Read values inbound in X-Analytics header (pageview and preview) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) (owner: 10Nuria) [21:01:59] (03CR) 10BBlack: [C: 032 V: 032] Read values inbound in X-Analytics header (pageview and preview) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) (owner: 10Nuria) [21:03:57] hm... traffic still seems to be going to the old one [21:04:20] what does the format look like.. if i wanted to manually send it via nc? [21:05:37] mutante: channeltext [21:06:04] it's weird, I'm definitely getting served by mw1017 [21:06:18] i tried that almost like that, just with #channel.. hmm [21:06:23] tried again [21:06:45] it auto join the channel [21:06:52] so you can emit to "#mutante" :D [21:06:58] modules/mw_rc_irc/templates/udpmxircecho.py.erb  [21:07:29] ok, i'm trying that but nothing happens [21:07:38] :( [21:09:37] Okay, wtf: [21:09:39] krenair@mw1017:/srv/mediawiki$ mwscript eval.php test2wiki [21:09:39] > echo $wmfAllServices['eqiad']['irc'] [21:09:39] 208.80.153.44 [21:09:39] > echo $wmgRC2UDPAddress; [21:09:39] 208.80.154.160 [21:10:14] the first part is wrong [21:10:32] well.. one or the other [21:11:09] just because it's not "eqiad" [21:11:14] either way [21:12:36] right, /tmp/mw-cache-1.27.0-wmf.22/conf-test2wiki exists there [21:13:45] old version is cached there [21:14:33] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 620 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5434136 keys - replication_delay is 620 [21:14:36] i should still be able to make that bot output something [21:14:44] from the server itself [21:16:01] yeah [21:18:18] mutante, I have it working with nc [21:18:23] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5432244 keys - replication_delay is 0 [21:18:35] echo -n "#testing asd" | nc -4u -w1 kraz.wikimedia.org 9390 [21:19:44] oh, it's the IPv4 vs IPv6 then [21:19:53] invalid option -- '4' [21:20:04] i just gave kraz a v6 IP earlier [21:20:18] MediaWiki just has the IPv4 address [21:21:41] !log starting OCG deploy (a little late) [21:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:22:09] i still cant confirm the same thing :p [21:22:14] wtf [21:22:29] you can't get nc working? [21:22:51] no, but i see your output [21:23:31] okay, instead of piping to nc, pipe it to xxd and post the output here [21:25:51] I want to make sure you got an actual tab instead of 8 space :p [21:25:56] spaces [21:26:33] !log updated OCG to version b775e612520f9cd4acaea42226bcf34df07439f7 [21:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:47] mutante? [21:27:48] (03CR) 10Jforrester: "How odd. I thought we'd intentionally removed it from regular users…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286522 (https://phabricator.wikimedia.org/T134196) (owner: 10Rillke) [21:28:34] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:34] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:43] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:43] PROBLEM - MariaDB Slave IO: x1 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:44] PROBLEM - MariaDB Slave IO: m2 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:44] PROBLEM - MariaDB Slave SQL: s2 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:45] PROBLEM - MariaDB Slave Lag: m3 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:54] PROBLEM - MariaDB Slave SQL: m2 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:54] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:54] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:55] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:55] PROBLEM - MariaDB Slave Lag: m2 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:55] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:28:56] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:03] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:03] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:03] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:03] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:05] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:14] PROBLEM - MariaDB Slave SQL: m3 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:14] PROBLEM - MariaDB Slave SQL: s4 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:14] PROBLEM - MariaDB Slave IO: s3 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:14] PROBLEM - MariaDB Slave SQL: s5 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:14] PROBLEM - MariaDB Slave IO: s7 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:14] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:14] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:23] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:24] PROBLEM - MariaDB Slave SQL: s6 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:24] PROBLEM - MariaDB Slave IO: s1 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:24] PROBLEM - MariaDB Slave IO: s5 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:24] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:24] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:24] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:25] volans_, [21:29:34] PROBLEM - MariaDB Slave IO: m3 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:34] PROBLEM - MariaDB Slave SQL: s7 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:43] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:45] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:54] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:54] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:55] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:55] PROBLEM - MariaDB Slave Lag: x1 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:55] PROBLEM - MariaDB Slave IO: s2 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:03] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:03] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:03] PROBLEM - MariaDB Slave IO: s4 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:04] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:07] um, did someone break icinga? fw issue? [21:30:13] oO [21:30:14] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:14] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:15] PROBLEM - MariaDB Slave SQL: x1 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:15] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:15] PROBLEM - MariaDB Slave IO: s6 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:24] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:30:25] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:30:33] PROBLEM - MariaDB Slave SQL: s1 on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:34] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86520.23 seconds [21:30:44] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:30:44] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:30:54] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 523.94 seconds [21:30:54] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [21:30:54] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:30:55] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:30:55] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:30:55] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:31:04] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [21:31:14] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:31:14] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:31:15] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86808.94 seconds [21:31:22] Krenair: sorry, back .. so 0000000: 2374 6573 7469 6e67 0968 6961 6c65 780a #testing.hialex. [21:31:23] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:31:24] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86650.74 seconds [21:31:44] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 84450.34 seconds [21:31:44] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:31:44] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:31:45] yes, space vs. tabs.. but it doesnt change things [21:31:50] Replication lag: 86650.74 seconds ???? [21:32:27] RECOVERY - MariaDB Slave IO: s3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:32:27] RECOVERY - MariaDB Slave SQL: m3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:32:27] RECOVERY - MariaDB Slave IO: s7 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:32:27] RECOVERY - MariaDB Slave SQL: s5 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:32:27] RECOVERY - MariaDB Slave SQL: s4 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:32:27] RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Slave_IO_Running: Yes, Slave_SQL_Running: No, (no error: intentional) [21:32:28] RECOVERY - MariaDB Slave IO: s5 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:32:28] RECOVERY - MariaDB Slave IO: s1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:32:29] RECOVERY - MariaDB Slave SQL: s6 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:32:35] mutante, that should be fine [21:32:38] RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 83693.00 seconds [21:32:38] RECOVERY - MariaDB Slave SQL: s7 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [21:32:38] RECOVERY - MariaDB Slave IO: m3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:32:38] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 84493.55 seconds [21:32:44] Luke081515: that is a day and 250 seconds [21:32:58] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86766.00 seconds [21:33:06] RECOVERY - MariaDB Slave Lag: x1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 23.87 seconds [21:33:06] Luke081515: some databases are replicated with a 24 hours delay lag [21:33:07] RECOVERY - MariaDB Slave IO: s2 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:33:08] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86697.00 seconds [21:33:15] hm, ok [21:33:28] RECOVERY - MariaDB Slave SQL: m2 on dbstore2001 is OK: OK slave_sql_state not a slave [21:33:29] Luke081515: so if eventually we screw up a master , we have 24 hours to get that slave out of sync and recover data from it [21:33:38] mutante, and your nc command has -4 or uses the IPv4 address? [21:33:44] for example one mistakenly removing every title of enwiki :-D [21:33:47] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86490.02 seconds [21:33:47] RECOVERY - MariaDB Slave SQL: s1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:33:47] RECOVERY - MariaDB Slave IO: m2 on dbstore2001 is OK: OK slave_io_state not a slave [21:33:56] Krenair: -4 is not an option of my nc, but yea, i also tried just the IP [21:34:05] what server are you trying this from? [21:34:06] RECOVERY - MariaDB Slave SQL: s2 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:34:11] cat /tmp/lala | nc -u -w1 208.80.153.44 9390 [21:34:17] RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86698.00 seconds [21:34:17] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86809.00 seconds [21:34:21] that file has the literal tab [21:34:26] RECOVERY - MariaDB Slave IO: s4 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:34:26] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:34:27] on mw1017? [21:34:32] cat: /tmp/lala: No such file or directory [21:34:33] no, on kraz itself [21:34:37] RECOVERY - MariaDB Slave SQL: x1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:34:37] RECOVERY - MariaDB Slave IO: s6 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:34:42] not even from local works [21:34:54] I don't know if you can do it on localhost [21:34:56] RECOVERY - MariaDB Slave IO: x1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:35:07] RECOVERY - MariaDB Slave Lag: m2 on dbstore2001 is OK: OK slave_sql_lag not a slave [21:35:22] try from mw1017 [21:35:26] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [21:35:26] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [21:35:26] RECOVERY - MariaDB Slave Lag: m3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 78252.02 seconds [21:36:48] Krenair: doesnt work :/ [21:39:30] Krenair: but you also did not run this on mw1017 ? [21:40:42] i can do this on mw1017 or on mw1099, i always see it sending data over to kraz [21:40:51] but that bot just doesnt want to talk [21:41:01] why does it listen to you :) [21:43:11] maybe we should use a script and just scp it :p [21:43:21] if it still only works for you i give up :) [21:43:24] okay [21:43:27] you know you were using cat? [21:43:35] cat ends the output with LF [21:43:41] krenair@mw1017:/srv/mediawiki$ cat /tmp/lala | tr "\n" " " | nc -u -w1 208.80.153.44 9390 [21:43:42] this worked [21:44:06] it does :) [21:44:07] I think the file just contains that LF actually [21:44:17] but either way it's the LF that seems to break things [21:44:26] oh gee. and i wanted to use a file to _reduce_ the change of a mess up [21:44:31] :) [21:44:34] yes, ack. thanks [21:44:42] ok, now let's see with hostname [21:44:55] should work if you use -4 [21:45:06] although MW just has the IPv4, so... [21:46:00] yea, so IPv6 doesnt work.. because.. looking [21:46:52] it does this: [21:46:53] udpsock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) [21:47:17] AF_INET is IPv4 [21:48:18] could also be AF_INET6 for IPv6, or AF_UNIX [21:50:07] ok, and the ircd should be just fine like it is [21:50:19] let's go back to [21:50:26] why we dont see output from test2 [21:50:32] while we can now manually make it work [21:52:43] you see the traffic going to kraz? [21:53:15] yes [21:54:09] yes, double confirmed [21:54:15] and yet it only appears on argon's IRC [21:54:23] yes [21:54:40] 21:53:54.530764 IP mw1017.eqiad.wmnet.50189 > kraz.wikimedia.org.9390: UDP, length 162 [21:55:17] no, it appears nowhere right now [21:55:32] I see my test.wikipedia.org edits on argon, not kraz [21:55:40] i edit test2.wp not test.wp [21:55:47] oh, it turned off [21:55:56] the debug plugin I mean [21:56:06] hah,yea, i had the same [21:56:16] but after i turned it back on, i see it trying to send to kraz [21:56:17] It shouldn't matter whether we do this on test2 or test [21:56:18] and silence [21:56:33] is there any traffic between argon and kraz? [21:56:45] kraz -> argon [21:57:55] no, argon sees nothing from kraz [21:58:58] and vice cersa [21:59:00] versa [21:59:24] Krenair: should i just try to create the same VM in eqiad ?:p [21:59:51] but .. public IP / [22:00:40] no [22:00:43] what about mw1017 -> argon? [22:02:43] no, i see nothing [22:05:07] Well my edits on testwiki via mw1017 somehow end up on argon's IRC feed, not kraz's [22:05:18] now mw1017 asked DNS servers for argon's PTR [22:05:33] (03PS1) 10Catrope: Show the cross-wiki notifications beta feature invitation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286534 [22:06:31] 07Blocked-on-Operations, 06Operations, 06Increasing-content-coverage, 06Research-and-Data-Backlog: Backport python3-sklearn and python3-sklearn-lib from sid - https://phabricator.wikimedia.org/T133362#2257963 (10leila) [22:07:04] (03PS2) 10Catrope: Show the cross-wiki notifications beta feature invitation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286534 (https://phabricator.wikimedia.org/T117669) [22:08:29] Krenair: trying something , removed the v6 IP [22:10:34] no difference. edits either go to old server (when x-debug-header is off) or they go nowhere [22:10:34] oh, I just remembered why testwiki still gets the old IP [22:10:37] !log restbase200[7-9]- signing puppet certs, salt-key, initial run [22:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:12:17] Krenair: btw, could we just tell all appservers to send to both ? [22:12:28] yes [22:12:43] that should not hurt i guess [22:12:47] let's do that [22:12:51] ok [22:13:04] i'll make a patch [22:13:12] cool, thanks ! [22:13:14] brb [22:13:42] I don't know what that character is you use, I just get a block with 007F [22:15:10] where? [22:15:37] cool, thanks ! [22:15:39] but .. public IP / [22:15:45] (03PS1) 10Alex Monk: Also send IRC stream to kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286537 [22:16:16] it was supposed to be this smiley: :/ [22:16:21] and i missed the : [22:16:29] so it's just a slash [22:17:08] (03CR) 10Alex Monk: [C: 032] Also send IRC stream to kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286537 (owner: 10Alex Monk) [22:17:34] (03Merged) 10jenkins-bot: Also send IRC stream to kraz [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286537 (owner: 10Alex Monk) [22:18:41] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/286537/ (duration: 00m 42s) [22:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:18:50] mutante, so kraz should now get plenty of traffic [22:20:00] Interesting thing is that udpmxircecho creates the channels correctly [22:20:04] but messages to them never show up [22:20:07] Krenair: it does! [22:20:08] confirmed [22:20:14] yes [22:20:42] oh, lol [22:20:46] May 2 22:20:38 kraz udpmxircecho.py[30092]: Carriage returns not allowed in privmsg(text) [22:21:01] is that getting spammed? [22:21:04] yea :) [22:21:32] Well, it is a different version of SingleServerIRCBot [22:21:40] i was about to say that same thing.yea [22:21:48] that will be it [22:22:31] | remove_carriage_returns.sh [22:23:34] confirmed, old server accepts messages with linebreaks, old one does not [22:23:47] well.. first of all.. glad we found it :) [22:23:57] do we need an .rstrip("\n") ? [22:24:03] (03PS1) 10Ppchelko: Change-Prop: Enable summary and definition updates. [puppet] - 10https://gerrit.wikimedia.org/r/286539 [22:24:10] ideally just if it's kraz [22:24:10] or \r I guess [22:24:25] then shut down the other one and remove the special case [22:24:56] do we change MW output format or what the bot does with the stuff it gets [22:25:05] I don't think we need to special case it [22:25:15] it should work fine on both servers right? [22:25:20] MW output shouldn't be changed [22:25:36] does old server accept it without linebreaks? [22:26:38] yes [22:27:32] sp = sp.replace('\r', '')) [22:27:35] tries [22:28:46] Heads up: codfw elasticsearch cluster is not behaving well. It looks similar to last week issue. We have a few more info than last week, but not much yet. User traffic does not seem to be affected at the moment. [22:29:21] 06Operations, 10ArchCom-RfC, 10Architecture, 10Incident-20150423-Commons, and 5 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#2258065 (10Danny_B) [22:30:50] you restarted kraz, mutante? [22:32:07] Krenair: just the IRC bot, not the server [22:32:21] I got this from ircd: * Server Terminating. Received SIGTERM [22:32:39] oh, yes, it was me [22:33:02] killed script by user [22:33:12] now I get * Connection failed. Error: Network is unreachable [22:34:12] Krenair: ircd is running again [22:34:42] is the bot running? [22:35:10] no [22:36:32] it's a list, cant just .replace [22:37:23] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2258091 (10Papaul) [22:37:23] oh, yeah [22:37:26] don't do it on sp [22:37:38] do it on text, after the lstrip [22:37:50] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2215958 (10Papaul) a:05Papaul>03fgiunchedi Installation complete [22:38:29] 06Operations, 10Monitoring, 07RfC: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158#2258097 (10Danny_B) [22:39:04] Krenair: it still spams Carriage returns not allowed [22:39:44] is it really \r ? [22:41:17] Krenair: it works !:) [22:41:24] replaced \r and \n [22:41:40] but you know what it also spams all day [22:41:44] 'ascii' codec can't decode byte 0xd0 in position 209: ordinal not in range(128) [22:41:45] (03PS1) 10EBernhardson: cirrus: Don't auto-create frozen index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286542 [22:41:51] probably like that since forever :) [22:42:34] if '\n' in string: [22:42:34] raise InvalidCharacters( [22:42:34] "Carriage returns not allowed in privmsg(text)") [22:42:37] helpful error message. [22:42:59] see all the output ?:) [22:43:06] join #en.wikipedia [22:43:09] yep [22:43:14] :) [22:43:38] that ascii thing might be a blocker though [22:44:04] isnt it the same on old? [22:44:26] no idea! [22:44:36] what does the log on old say? [22:45:08] cant connect to the host [22:45:12] now .. [22:45:43] you can't ssh to argon? [22:46:11] problem on codfw elasticsearch cluster identified (thanks ebernhardson!). Temporary fix coming (https://gerrit.wikimedia.org/r/286541) and emails with more details tomorrow after some sleep. [22:46:17] yea, i can almost not work because everything is slo prslow [22:46:33] 06Operations, 07RfC, 07discovery-system, 05services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#2258143 (10Danny_B) [22:47:52] gotta fix this too Configuration file /etc/systemd/system/ircd.service is marked executable. Please remove executable permission bits. Proceeding anyway. [22:48:24] i'm not sure if ther eis a log for this [22:49:01] (03CR) 10EBernhardson: [C: 032] cirrus: Don't auto-create frozen index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286542 (owner: 10EBernhardson) [22:50:05] (03PS2) 10EBernhardson: cirrus: Don't auto-create frozen index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286542 [22:50:16] mutante, I think this new server isn't handling non-ascii properly [22:50:23] #ru.wikipedia is silent on kraz but not argon [22:50:24] (03CR) 10EBernhardson: [C: 032] cirrus: Don't auto-create frozen index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286542 (owner: 10EBernhardson) [22:50:51] (03Merged) 10jenkins-bot: cirrus: Don't auto-create frozen index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286542 (owner: 10EBernhardson) [22:51:08] ok, of course it wouldnt just let us solve it [22:51:23] jouncebot, next [22:51:23] In 0 hour(s) and 8 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160502T2300) [22:52:02] we have a new wiki creation in swat Dereckson? [22:52:19] imo those don't go in swat, but should have their own window? [22:52:21] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: Config setting to stop auto-creating frozen index in cirrus (duration: 00m 33s) [22:52:23] yes, you can reject it.. [22:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:29] what a great day [22:52:45] does it break anything? [22:53:16] I'm more worried about it taking forever with 5 different patches and scripts to run [22:53:20] I'll do it at the end [22:53:51] oh and I'm pretty sure addWiki.php is broken [22:53:52] why cant it just be partially done [22:54:14] Krenair: that was the second stuff to discuss with mutante, yes [22:54:31] i've got a patch merging right now that will be pushed to wmf.22 in a minute, some issues on the codfw elasticsearch cluster this fixes. Shouldn't get in the way of swat though because you have to wait on merges too :) [22:54:41] i dont understand why we cant do X just because Y also has to be done later [22:55:01] Krenair: now, if the SWAT windows is virtually empty... [22:55:08] ... perhaps there is time to do it. [22:55:17] it has 4 other entries [22:55:22] it's not empty, though it's not full either [22:55:32] they just got added after mine. [22:55:35] just abandon it [22:55:57] What's the addWiki.php issue by the way? [22:56:21] it calls wikidata's site table population script, which tries to import files from a non-existent directory [22:56:39] !log ebernhardson@tin Synchronized php-1.27.0-wmf.22/extensions/CirrusSearch/: Cirrus: Stop auto-creating frozen index (duration: 00m 31s) [22:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:53] I uploaded a fix earlier [22:57:58] last time we had issues with CirrusSearch's script (also called by addWiki) timing out due to particularly high search cluster load [22:58:47] we have 16 new servers racked up to fix that, but not in the cluster yet [22:59:03] (well, not just fixing that. but it should) [22:59:12] sounds like a great time for "some issues on the codfw elasticsearch cluster Shouldn't get in the way of swat" [22:59:27] mutante: it's done now, synced a couple minutes ago [22:59:31] ah.. well and my connection is so slow.. i see everything 2 minutes later [22:59:47] Krenair: what's the drawback to merge the config part right now by the way without an actual wiki running? [23:00:04] i cant switch focus anymore.. too many things . i'll just fix the first bug about udpmx [23:00:04] RoanKattouw ostriches Krenair Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160502T2300). Please do the needful. [23:00:04] mutante matt_flaschen RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:06] You want to add to the dblists without actually creating the DB? [23:00:11] jouncebot: remove [23:00:31] Present [23:00:33] Okay, so dblists are the issue. Noted. [23:00:38] i did not add them,. others amended [23:00:39] I'll do it today [23:00:44] i want nothing anymore right now [23:01:02] mutante: yes I've added them to the change, indeed [23:01:12] It's pointless doing the rest of the change without the dblist changes, otherwise you'll have to come back and run another mediawiki-config patch later [23:01:24] Or, it sounds like Krenair is doing SWAT today? [23:01:54] here's the thing.. we dnt merge a small change "because it needs X as well". then we dont merge it because "you want to do X too??": [23:02:14] matt_flaschen first [23:03:37] Hmmmm we already had this discussion in a previous wiki creation last year: I offered a change with graphics only, ie the logo files. Krenair, you wanted them to be included in a more comprehensive, mutante you have supported the idea to divide in smaller changes accomplishing one goal. [23:03:51] more comprehensive change [23:05:27] The two methods each have pros and cons I guess. The pro is an unified review and to avoid to forgot one point in the checklist. The cons is that needs the last moment to be merged in full. But then, this drawback is mitigated by the fact we would need every patch to achieve wiki creation. [23:08:38] PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80614 MB (15% inode=99%) [23:10:39] !log krenair@tin Synchronized php-1.27.0-wmf.22/extensions/Flow/modules/flow/ui/widgets/editor/mw.flow.ui.EditorSwitcherWidget.js: https://gerrit.wikimedia.org/r/#/c/286527/ (duration: 00m 26s) [23:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:08] matt_flaschen, ^ [23:11:49] Dereckson: one of the methods makes people volunteer if they know a part of it, the other makes them think "it's gonna be -1 anyways and have 30 patch sets, let others do it" [23:12:17] fixes the IRCbot issue for now [23:15:08] PROBLEM - check_mysql on fdb2001 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [23:16:11] Works, thanks Krenair. [23:16:19] PROBLEM - Disk space on elastic1029 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80097 MB (15% inode=99%) [23:18:04] RoanKattouw's changes next [23:21:49] (03PS1) 10Dzahn: udpmxircecho: remove newlines from RC data [puppet] - 10https://gerrit.wikimedia.org/r/286544 (https://phabricator.wikimedia.org/T123729) [23:22:50] (03PS2) 10Dzahn: udpmxircecho: remove newlines from RC data [puppet] - 10https://gerrit.wikimedia.org/r/286544 (https://phabricator.wikimedia.org/T123729) [23:23:37] (03CR) 10Alex Monk: [C: 031] "Technically this could just be adding .rstrip("\n"), but I think it's safe" [puppet] - 10https://gerrit.wikimedia.org/r/286544 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [23:24:43] (03PS3) 10Dzahn: udpmxircecho: remove newlines from RC data [puppet] - 10https://gerrit.wikimedia.org/r/286544 (https://phabricator.wikimedia.org/T123729) [23:25:13] (03CR) 10Dzahn: [C: 032] "works like this on kraz" [puppet] - 10https://gerrit.wikimedia.org/r/286544 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [23:25:41] (03CR) 10Dzahn: [C: 04-2] udpmxircecho: remove newlines from RC data [puppet] - 10https://gerrit.wikimedia.org/r/286544 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [23:27:09] robh: can you test restbase2004 again [23:27:18] RECOVERY - Host restbase2004 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [23:27:19] before i leave the dc once agan [23:29:15] RoanKattouw, want to test this on mw1017 first? [23:29:31] Krenair: It's behind a feature flag that's only true on testwiki [23:29:44] ah [23:29:56] So it should be safe to push out, then let me test it before you deploy the config patch which flips the feature flag [23:30:42] !log krenair@tin Synchronized php-1.27.0-wmf.22/extensions/Echo: https://gerrit.wikimedia.org/r/#/c/286535/ and https://gerrit.wikimedia.org/r/#/c/286532/ (duration: 00m 30s) [23:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:22] RoanKattouw, ^ [23:31:23] [23:31:28] I'm an idiot [23:31:33] yeah it needs scap [23:31:34] I had forgotten there was i18n in there [23:31:42] or was there some shorter way to just do the i18n prt? [23:31:43] part* [23:32:14] I think there was at least at one point, not sure if it still exists [23:32:16] nope, docs say scap [23:32:16] * RoanKattouw looks [23:32:21] ah, ok [23:32:32] !log krenair@tin Started scap: for https://gerrit.wikimedia.org/r/#/c/286535/ i18n changes [23:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:49] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: Connection refused [23:33:27] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:34:37] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [23:37:39] !log restart elastic2007, codfw cluster master, to resolve lingering issues after resolving frozen index race condition [23:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:25] Krenair: i fixed the unicode issue it looks :) see the ru.wp [23:44:49] nice. what was the change? [23:45:01] import sys [23:45:01] reload(sys) [23:45:02] sys.setdefaultencoding('utf8') [23:45:27] * Krenair is a little concerned by scap being stuck at: sync-common: 0% (ok: 0; fail: 0; left: 401) [23:45:29] mutante, ew [23:45:52] and this is just .. as it is [23:45:54] Messages limited to 512 bytes including CR/LF [23:46:06] there aren't many long ones [23:46:09] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [23:46:43] mwdeploy 9127 7287 0 Apr30 ? 00:00:03 /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -F/dev/null -oUser=mwdeploy mw1134.eqiad.wmnet sudo -u mwdeploy -n -- /usr/bin/scap-rebuild-cdbs --version 1.27.0-wmf.22 [23:46:44] mwdeploy 15050 12645 0 Apr17 ? 00:00:19 /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -F/dev/null -oUser=mwdeploy mw1117.eqiad.wmnet sudo -u mwdeploy -n -- /usr/bin/scap-rebuild-cdbs --version 1.27.0-wmf.21 [23:46:44] mwdeploy 15786 13774 0 Apr24 ? 00:00:09 /usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 -F/dev/null -oUser=mwdeploy mw1142.eqiad.wmnet sudo -u mwdeploy -n -- /usr/bin/scap-rebuild-cdbs --version 1.27.0-wmf.21 [23:47:04] from april? [23:48:30] sync-common is unstuck now [23:50:29] (03PS1) 10Dzahn: udpmxircecho: fix utf-8 encoding issue [puppet] - 10https://gerrit.wikimedia.org/r/286546 (https://phabricator.wikimedia.org/T123729) [23:52:36] papaul: its god now [23:52:37] good [23:52:40] thanks' [23:52:45] sorry was afk for a bit [23:52:51] landlord fixing stuff at my place =] [23:53:14] no problem [23:58:07] RECOVERY - Disk space on elastic1029 is OK: DISK OK [23:58:30] !log krenair@tin Finished scap: for https://gerrit.wikimedia.org/r/#/c/286535/ i18n changes (duration: 25m 57s) [23:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:41] RoanKattouw, ^ [23:59:50] OK, will test again