[00:00:01] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1586098 (10Dzahn) now still: Aug 28 23:58:51 fermium apache2[23067]: AH00526: Syntax error on line 26 of /etc/apache2/sites-en...onf: Aug 28 23:58:... [00:00:07] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: apply regular lists role on fermium and confirm no issues - https://phabricator.wikimedia.org/T109925#1586099 (10Dzahn) 5Resolved>3Open [00:00:09] 6operations, 10Wikimedia-Mailing-lists: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1586100 (10Dzahn) [00:00:36] PROBLEM - HTTPS on fermium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [00:03:07] ACKNOWLEDGEMENT - HTTPS on fermium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn T109925#1586098 [00:04:51] jynus: I wonder why the 24h view at https://tendril.wikimedia.org/host/view/db1072.eqiad.wmnet/3306 as really 12hr? :) [00:05:00] that keeps bugging me, heh [00:06:12] gwicke: do you want to play with this over thee weekend at all? if so i'll find some time (if i can ever log in) later to set it up [00:10:09] (03PS1) 10John F. Lewis: mailman: correct apache 2.4 usage of require not [puppet] - 10https://gerrit.wikimedia.org/r/234702 [00:11:52] 6operations, 6Analytics-Backlog, 6Labs, 10Wikimedia-Apache-configuration, and 2 others: https://wikitech.wikimedia.org/beacon/statsv 404 Not Found - https://phabricator.wikimedia.org/T104359#1586118 (10Krenair) [00:11:56] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [00:11:57] (03PS1) 10Alex Monk: Fix wikitech beacon 204 [puppet] - 10https://gerrit.wikimedia.org/r/234703 (https://phabricator.wikimedia.org/T104359) [00:19:10] 6operations, 10Analytics-Cluster, 10Traffic: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1586144 (10kevinator) [00:19:13] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build 0.8.2.1 Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1586140 (10kevinator) 5Open>3Resolved [00:52:58] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1586319 (10brion) As noted on T110707 this config looks very close to ready -- it's almost-functional on beta cluster! But we're... [00:58:26] PROBLEM - Disk space on cp3020 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=89%) [00:59:30] (03CR) 10Dzahn: [C: 032] "thanks for the fix! -> ""when the Require directive is negated it can only fail or return a neutral result, and therefore may never indepe" [puppet] - 10https://gerrit.wikimedia.org/r/234702 (owner: 10John F. Lewis) [01:02:23] (03CR) 10Dzahn: "Aug 29 01:01:18 fermium apache2[1675]: AH00526: Syntax error on line 27 of /etc/apache2/sites-enabled/50-lists-wikimedia-org.conf:" [puppet] - 10https://gerrit.wikimedia.org/r/234702 (owner: 10John F. Lewis) [01:03:48] (03CR) 10Dzahn: "When multiple Require directives are used in a single configuration section and are not contained in another authorization directive like " [puppet] - 10https://gerrit.wikimedia.org/r/234702 (owner: 10John F. Lewis) [01:07:46] RECOVERY - mailman archives on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 59374 bytes in 0.036 second response time [01:07:49] RECOVERY - HTTPS on fermium is OK: SSL OK - Certificate lists.wikimedia.org valid until 2016-01-31 06:08:49 +0000 (expires in 155 days) [01:18:54] (03PS1) 10Dzahn: mailman: fix Apache 2.4 require syntax pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/234708 [01:21:02] (03PS2) 10Dzahn: mailman: fix Apache 2.4 require syntax pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/234708 [01:21:15] (03PS3) 10Dzahn: mailman: fix Apache 2.4 require syntax pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/234708 [01:25:38] (03CR) 10Dzahn: [C: 032] mailman: fix Apache 2.4 require syntax pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/234708 (owner: 10Dzahn) [01:47:27] AaronSchulz: http://muratbuffalo.blogspot.com/2014/07/hybrid-logical-clocks.html [01:57:07] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1586449 (10mmodell) So I've spent a bit of time digging into this issue. I was thinking that it's kind of odd that iridium (phabricator) is consistently receiving... [02:05:46] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied [02:21:07] !log l10nupdate@tin Synchronized php-1.26wmf20/cache/l10n: l10nupdate for 1.26wmf20 (duration: 05m 48s) [02:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:02] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf20) at 2015-08-29 02:24:01+00:00 [02:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:37] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 20.69% of data above the critical threshold [100000000.0] [02:51:46] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:00:46] RECOVERY - Last backup of the maps filesystem on labstore1002 is OK: OK - Last run successful [04:17:55] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Aug 29 04:17:55 UTC 2015 (duration 17m 54s) [04:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:09:36] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [05:10:15] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [05:30:33] 6operations, 10Wikimedia-Mailing-lists: Let public archives be indexed and archived - https://phabricator.wikimedia.org/T90407#1586541 (10Revi) >>! In T90407#1555920, @Dzahn wrote: > Aren't all the public archives on gmane.org anyways and get indexed there? Gmane is opt-in (you have to request inclusion manu... [05:35:05] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Puppet has 1 failures [05:52:06] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: Puppet has 1 failures [05:57:35] RECOVERY - Disk space on labstore1002 is OK: DISK OK [06:00:45] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:19:46] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:55] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: puppet fail [06:32:16] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:06] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:55] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:25] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:58:36] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:01] 6operations, 10Security-Reviews, 7Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#1586585 (10Moushira) Thanks @csteipp for your help, and indeed, it was @tstarling who got the old version removed. So apparently we should host it on third party servers, what are the next steps... [08:18:15] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: Puppet has 1 failures [08:33:04] (03CR) 10Hashar: "Or for a BetaclusterPuppetSwat ™® :-}" [puppet] - 10https://gerrit.wikimedia.org/r/234599 (https://phabricator.wikimedia.org/T110707) (owner: 10BryanDavis) [08:43:55] RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:58:10] 6operations, 10ops-eqiad, 7Database: Disk issue on db1028 - https://phabricator.wikimedia.org/T103230#1586640 (10jcrespo) 5Resolved>3Open This issue happened again, again the same offender (lots of log entries at 8:46 today): ``` Code: 0x0000005d Class: 0 Locale: 0x02 Event Description: Patrol Read cor... [09:00:41] 6operations, 10ops-eqiad, 7Database: Disk issue on db1028 - https://phabricator.wikimedia.org/T103230#1586642 (10jcrespo) @cmjohnson This one uses 279.396 GB disks, Can we replace the above disk with one of the spares? [09:04:11] yeah, it is not getting any better [09:05:28] !log about to depool db1028 due to disk issue [09:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:14:51] (03Abandoned) 10Reedy: chapcomwiki -> affcomwiki [puppet] - 10https://gerrit.wikimedia.org/r/169944 (https://bugzilla.wikimedia.org/39482) (owner: 10Reedy) [09:15:21] (03Abandoned) 10Reedy: Rename chapcomwiki to affcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169939 (https://bugzilla.wikimedia.org/39482) (owner: 10Reedy) [09:15:41] (03Abandoned) 10Reedy: Rebuild beta apache config ontop of production config [puppet] - 10https://gerrit.wikimedia.org/r/173492 (owner: 10Reedy) [09:16:49] (03Abandoned) 10Reedy: Allow faux-renaming/database remapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [09:19:02] (03PS1) 10Jcrespo: depool db1028; pool es1012, es1005, es1008, increase es1014 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234719 [09:20:53] (03CR) 10Jcrespo: [C: 032] depool db1028; pool es1012, es1005, es1008, increase es1014 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234719 (owner: 10Jcrespo) [09:21:25] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [09:28:12] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1028, return ES servers back from maintenance (duration: 00m 03s) [09:28:15] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1586660 (10Reedy) >>! In T41482#1575311, @gerritbot wrote: > Change 233972 had a related patch set uploaded (by Alex Monk): > Add affcom wiki domain to apache config > >... [09:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:28:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1028, return ES servers back from maintenance (duration: 00m 03s) [09:30:11] !log SCAP failed, cannot depool db1028 [09:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:30:52] What's the error jynus? [09:31:02] Permission denied (publickey) [09:31:05] on sync [09:31:12] just one server? [09:31:22] sync-common: 100% (ok: 0; fail: 466; left: 0) [09:31:25] all [09:31:57] i am not worried abouf db1018- mw depooled it automatically [09:32:06] let's see [09:33:33] I think the other day there was some changing ongoing about keys on scap [09:33:44] * Reedy notices another problem to file an issue for [09:34:39] Did you let that run to completion too? As it shouldn't log that it completed... [09:34:52] true :-) [09:35:29] well, it completed, it just happens that all servers failed :-) [09:36:29] Filed that too [09:36:33] all proxies failed too [09:36:35] yeah [09:36:39] that was the first issue I filed [09:36:46] it shouldn't continue if all the proxies don't sync [09:36:55] [10:31:21] Deployment-Systems, Release-Engineering: Don't continue scap if sync to all proxies failed - https://phabricator.wikimedia.org/T110791#1586661 (Reedy) NEW [09:36:55] [10:33:23] Deployment-Systems, Release-Engineering: scap shouldn't log completion (it should log fail!) - https://phabricator.wikimedia.org/T110793#1586675 (Reedy) NEW [09:37:35] well, as this is not critical, I will let it stay [09:37:56] servers have the previous version and that is ok for me for now [09:37:59] Looks like people were syncing last night [09:38:56] Guess that should be filed for the moment too as broken [09:39:09] I will do that [09:39:31] ok, thanks [09:39:38] no, thank you! [09:49:16] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:23] 6operations, 10ops-eqiad, 7Database: Disk issue on db1028 - https://phabricator.wikimedia.org/T103230#1586710 (10jcrespo) [09:52:41] BTW, thanks jynus from the past for leaving me notes on icinga of past issues. They are very useful! [10:26:16] jynus: yw [12:57:26] PROBLEM - Disk space on cp3020 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=89%) [13:07:43] Krenair: MInd looking at https://phabricator.wikimedia.org/T109964 and see if I shouldn't have reopened it? [13:09:30] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1586764 (10Josve05a) [13:09:48] 6operations, 6Phabricator, 7Database: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1564710 (10Josve05a) sorry, ready that comment wrong. [13:13:42] 6operations, 6Phabricator, 7Database: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1586762 (10Josve05a) Also: ``` Can Not Connect to MySQL Unable to connect to MySQL! Attempt to connect to phuser@m3-master.eqia... [13:49:33] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1586776 (10jcrespo) {F2486374} [14:02:40] 6operations, 10Deployment-Systems, 6Release-Engineering: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586779 (10Krenair) [14:03:15] 6operations, 10Deployment-Systems, 6Release-Engineering: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586782 (10Krenair) Keyholder issue? ```krenair@tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mw2001 Permission denied (publickey).``` [14:04:09] 6operations, 10Deployment-Systems, 6Release-Engineering: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586784 (10Krenair) (Works fine from mira.) [14:15:29] 6operations, 10Deployment-Systems, 6Release-Engineering: SCAP fails with Permission denied (publickey) - https://phabricator.wikimedia.org/T110794#1586793 (10Krenair) Yeah, icinga has been showing this for tin's keyholder service for the past 14 hours: CRITICAL: Keyholder is not armed. Run 'keyholder arm' to... [14:53:05] (03PS1) 10Alex Monk: Fix /static 404s in beta mobile [puppet] - 10https://gerrit.wikimedia.org/r/234733 (https://phabricator.wikimedia.org/T105541) [15:05:57] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: puppet fail [15:13:03] Phabricator is throwing errors currently. [15:13:32] yep [15:13:43] https://phabricator.wikimedia.org/T109964 [15:15:31] https://phabricator.wikimedia.org/T109279 I guess. [15:15:36] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [15:20:27] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1586850 (10jcrespo) These are the process list on one of those peaks: {F2486928} [15:26:16] !log killing idle mysql connections from phabricator and setting wait and interactive timeout to 60 [15:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:29] it won't solve th underlying issue, it may break things, but at least it will avoid the error [15:30:55] fixed the proxy too, gone for now [15:31:46] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [15:35:56] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:42:39] (03CR) 10BBlack: [C: 031] Fix /static 404s in beta mobile [puppet] - 10https://gerrit.wikimedia.org/r/234733 (https://phabricator.wikimedia.org/T105541) (owner: 10Alex Monk) [16:30:26] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:06] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [16:33:57] (03CR) 10Alex Monk: [C: 04-1] "Unanswered questions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224771 (owner: 10Dereckson) [16:34:36] (03CR) 10Alex Monk: [C: 04-1] "Unaddressed comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221731 (https://phabricator.wikimedia.org/T27397) (owner: 10TheDJ) [16:36:18] (03CR) 10Alex Monk: [C: 04-1] "I don't like the syntax this extension provides in wikitext and would prefer it to not be deployed to wikitech." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [16:39:30] (03PS2) 10Alex Monk: Creating closed-labs.dblist and closing es.wikipedia.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234594 (https://phabricator.wikimedia.org/T109157) (owner: 10MarcoAurelio) [16:41:07] (03CR) 10Alex Monk: [C: 031] "Will get this done on Monday assuming nobody objects." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234040 (https://phabricator.wikimedia.org/T76957) (owner: 10Deskana) [16:42:12] (03CR) 10Dzahn: [C: 031] Add link to developer app guidelines from dumps pages footer [puppet] - 10https://gerrit.wikimedia.org/r/234685 (https://phabricator.wikimedia.org/T110742) (owner: 10Alex Monk) [16:45:35] (03CR) 10Alex Monk: [C: 031] "I'll get this done on Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234594 (https://phabricator.wikimedia.org/T109157) (owner: 10MarcoAurelio) [17:01:07] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: puppet fail [17:01:55] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1586919 (10ori) The Phabricator daemon log (viewable by running `/srv/phab/phabricator/bin/phd log` on iridium) is full of errors. Is there any causal relationshi... [17:10:46] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1586921 (10ori) >>! In T109279#1586850, @jcrespo wrote: > These are the process list on one of those peaks: > {F2486928} > > Most connections are idling- I've se... [17:30:36] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:31:35] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: Puppet has 1 failures [18:38:27] (03PS2) 10Yurik: Added maps-cluster referer rules (e.g. Phab) [puppet] - 10https://gerrit.wikimedia.org/r/234600 [18:59:05] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:48:17] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1587131 (10Krenair) I guess another thing you'd want to check is who has staff membership/adminship in labs projects when they leave. [22:17:04] (03PS1) 10Negative24: hiera: Remove phab-02 data [puppet] - 10https://gerrit.wikimedia.org/r/234808 [22:40:26] (03CR) 10Alex Monk: [C: 04-1] "Misses foundation.conf, remnant.conf, wikimania.conf, wikimedia.conf." [puppet] - 10https://gerrit.wikimedia.org/r/217794 (https://phabricator.wikimedia.org/T94570) (owner: 10Muehlenhoff) [22:52:31] ori, around? [22:53:13] tin:/srv/mediawiki-staging/tests/multiversion/MWMultiversionTest.php is owned by root and not writable by anyone else [23:09:10] (03CR) 10Alex Monk: "In production on tin I found that there are some files you can't write to without being root. In particular:" [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [23:12:39] (03CR) 10Alex Monk: [C: 04-1] scap: Add co-master configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis)