[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170112T0000). Please do the needful. [00:00:04] MarcoAurelio: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:05] (03CR) 10Reedy: [C: 032] Update Kafka analytics broker list for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 (owner: 10Ottomata) [00:00:29] jouncebot: my patch already deployed tnx [00:00:40] gj [00:00:43] jouncebot: go away [00:00:44] Shall we do the other rename one? [00:01:17] the trwiki one? I have not asked them yet so I'd wait to avoid any drhamah [00:02:05] (03PS4) 10Reedy: Update Kafka analytics broker list for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 (owner: 10Ottomata) [00:02:10] (03CR) 10Reedy: [C: 032] Update Kafka analytics broker list for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 (owner: 10Ottomata) [00:02:22] TabbyCat: But wikimedians love teh dramas! :D [00:02:37] I have enough of it for this month [00:03:02] I'll ask them tomorrow [00:03:42] (03CR) 10Aaron Schulz: [C: 032] Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz) [00:04:16] (03Merged) 10jenkins-bot: Update Kafka analytics broker list for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 (owner: 10Ottomata) [00:04:40] (03CR) 10jenkins-bot: Update Kafka analytics broker list for deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287741 (owner: 10Ottomata) [00:05:28] (03PS3) 10Reedy: Add transitionary config for EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 [00:05:34] (03CR) 10Reedy: [C: 032] Add transitionary config for EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 (owner: 10Reedy) [00:05:47] (03Merged) 10jenkins-bot: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz) [00:05:57] (03CR) 10jenkins-bot: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz) [00:06:22] (03CR) 10Chad: [C: 032] Adding language name configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315912 (https://phabricator.wikimedia.org/T113408) (owner: 10Jon Harald Søby) [00:07:04] (03Merged) 10jenkins-bot: Add transitionary config for EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 (owner: 10Reedy) [00:07:10] (03PS4) 10Reedy: Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [00:07:15] (03CR) 10Reedy: [C: 032] Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [00:08:09] (03CR) 10jenkins-bot: Add transitionary config for EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303383 (owner: 10Reedy) [00:09:20] (03Merged) 10jenkins-bot: Adding language name configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315912 (https://phabricator.wikimedia.org/T113408) (owner: 10Jon Harald Søby) [00:10:15] (03CR) 10jenkins-bot: Adding language name configuration for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315912 (https://phabricator.wikimedia.org/T113408) (owner: 10Jon Harald Søby) [00:12:09] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Wikidata lang config (duration: 00m 38s) [00:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:48] !Log restarted apache2 and mysql on bohrium to see if mysql no connection errors disappear [00:12:58] !log restarted apache2 and mysql on bohrium to see if mysql no connection errors disappear [00:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:48] (03PS5) 10Reedy: Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [00:13:59] (03CR) 10Reedy: Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [00:14:04] (03CR) 10Reedy: [C: 032] Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [00:15:20] (03Merged) 10jenkins-bot: Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [00:15:35] (03CR) 10jenkins-bot: Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) (owner: 10Gilles) [00:15:50] (03CR) 10Addshore: [C: 04-1] Enable ElectronPdfService extension on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324488 (https://phabricator.wikimedia.org/T150943) (owner: 10Addshore) [00:15:53] (03CR) 10Addshore: [C: 04-1] Enable ElectronPdfService extension on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324489 (https://phabricator.wikimedia.org/T150942) (owner: 10Addshore) [00:16:16] addshore: I was just about to ask about those 2 :) [00:16:47] ostriches: reedy came over ;) [00:17:09] They'll get out eventually over the next month! [00:17:13] Okie dokie [00:17:33] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:33] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:17:47] !log reedy@tin Synchronized wmf-config: More consistency for various commits (duration: 00m 40s) [00:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:23] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [00:18:23] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [00:18:26] (03CR) 10Chad: [C: 032] noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091 (owner: 10Krinkle) [00:20:02] (03Merged) 10jenkins-bot: noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091 (owner: 10Krinkle) [00:20:12] (03CR) 10jenkins-bot: noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091 (owner: 10Krinkle) [00:20:52] (03PS1) 10Addshore: Enable ElectronPdfService on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331807 [00:20:57] Reedy: ostriches ^^ that one would be nice though! [00:21:02] !log demon@tin Synchronized docroot/noc/db.php: (no message) (duration: 00m 39s) [00:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:13] (03CR) 10Reedy: [C: 032] Enable ElectronPdfService on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331807 (owner: 10Addshore) [00:21:53] RECOVERY - puppet last run on dbproxy1006 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [00:22:41] (03Merged) 10jenkins-bot: Enable ElectronPdfService on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331807 (owner: 10Addshore) [00:22:56] (03CR) 10jenkins-bot: Enable ElectronPdfService on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331807 (owner: 10Addshore) [00:23:03] kaldari: https://gerrit.wikimedia.org/r/#/c/324672/ can that go out? [00:23:22] Reedy: Nope [00:23:23] not yet [00:23:49] kaldari: Mind dropping a -1 on it? :) [00:23:54] sure [00:24:06] (03CR) 10Kaldari: [C: 04-1] "Not ready yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324672 (https://phabricator.wikimedia.org/T152076) (owner: 10Kaldari) [00:24:10] (03PS2) 10Reedy: Use internal url for Ores, move to ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316317 (owner: 10Giuseppe Lavagetto) [00:24:34] thanks! [00:24:41] (03PS2) 10Chad: wikitech: Add oathauth group with oathauth-api-all right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327852 (https://phabricator.wikimedia.org/T153487) (owner: 10BryanDavis) [00:25:03] (03PS1) 10Aaron Schulz: Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 [00:25:26] kaldari: i will buy you a beer when it is out [00:26:54] (03CR) 10Chad: [C: 032] wikitech: Add oathauth group with oathauth-api-all right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327852 (https://phabricator.wikimedia.org/T153487) (owner: 10BryanDavis) [00:27:37] (03PS2) 10Aaron Schulz: Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 [00:28:03] (03CR) 10Reedy: Use internal url for Ores, move to ProductionServices.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316317 (owner: 10Giuseppe Lavagetto) [00:28:37] (03Merged) 10jenkins-bot: wikitech: Add oathauth group with oathauth-api-all right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327852 (https://phabricator.wikimedia.org/T153487) (owner: 10BryanDavis) [00:28:50] (03CR) 10jenkins-bot: wikitech: Add oathauth group with oathauth-api-all right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327852 (https://phabricator.wikimedia.org/T153487) (owner: 10BryanDavis) [00:29:58] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: oathauth group for wikitech (duration: 00m 38s) [00:29:59] (03PS2) 10Reedy: Set wgSemiprotectedRestrictionLevels for de.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282471 (https://phabricator.wikimedia.org/T132249) (owner: 10Dereckson) [00:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:19] (03PS3) 10Reedy: Set wgSemiprotectedRestrictionLevels for de.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282471 (https://phabricator.wikimedia.org/T132249) (owner: 10Dereckson) [00:30:36] (03CR) 10Reedy: [C: 04-1] "Consensus needed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282471 (https://phabricator.wikimedia.org/T132249) (owner: 10Dereckson) [00:31:13] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:57:52] (03CR) 10MZMcBride: "I wasn't speaking hypothetically, of course. You're almost certainly noticing the same behavior that I'm seeing, with bot edits such as th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie) [01:00:04] Deploy window No Deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170112T0100) [01:00:13] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:02:00] (03PS3) 10Krinkle: Include DB shard in production SPI log entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [01:07:24] (03PS3) 10Volans: Initial import with the first version [software/cumin] - 10https://gerrit.wikimedia.org/r/330425 (https://phabricator.wikimedia.org/T154588) [01:08:47] (03PS1) 10Dereckson: Use directly wgGalleryOptions without wmg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331819 [01:09:36] (03CR) 10Krinkle: [C: 04-1] "Looking at logstash-beta it seems this field is showing up fine, but it does have a warning next to it about "No cache mapping for this fi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331808 (owner: 10Aaron Schulz) [01:10:11] elukey: https://gerrit.wikimedia.org/r/#/c/327686/ :) [01:14:14] (03PS4) 10Dereckson: Configure he.wiki images size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/31580 (https://phabricator.wikimedia.org/T43712) [01:15:17] (03CR) 10Dereckson: "PS4: short array syntax, rebased against gilles change, rebased against wmg/wg cleaning change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/31580 (https://phabricator.wikimedia.org/T43712) (owner: 10Dereckson) [01:16:36] (03CR) 10Dereckson: "To the deployer: this change touches CS and IS. Need a kludge (copy $wg/$wmg in IS or CS) to deploy it. Order matters." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331819 (owner: 10Dereckson) [01:16:44] (03PS1) 10Brion VIBBER: Add 'webp' package to ImageMagick role [puppet] - 10https://gerrit.wikimedia.org/r/331820 (https://phabricator.wikimedia.org/T27397) [01:22:03] (03PS2) 10Dereckson: Explicit dblist name for compact language links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315983 [01:22:15] (03CR) 10Dereckson: "PS2: Rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315983 (owner: 10Dereckson) [01:23:48] (03CR) 10jerkins-bot: [V: 04-1] Explicit dblist name for compact language links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315983 (owner: 10Dereckson) [01:27:46] https://commons.wikimedia.org/w/index.php?title=File:Education_sounds.ogg&action=delete [01:27:58] “A database query error has occurred. This may indicate a bug in the software.[WHbbdQpAADsAAOUkYz8AAAAU] 2017-01-12 01:27:33: Fatal exception of type "DBQueryError”” [01:28:36] File is a 2GB+ ‘audio file’ with an embedded rar archive, and needs to go away. [01:29:03] "Lock wait timeout exceeded; try restarting transaction" [01:29:06] Is the error in question [01:29:18] Odd. [01:29:45] Revent: That shouldn't be impacted by filesize, that's just the DB query part of deleting...try again? [01:29:48] * ostriches shrugs [01:29:54] ostriches: It eventually was deleted, after a significant time delay. [01:30:22] Yeah, just a slow query :( [01:30:26] Glad it's gone now [01:30:34] logged deletion time was 1:24, it gave me the error, and then went away about a minute ago. [01:31:29] Reedy: Ugh, the MassMessage fix(es) are spamming a bit. [01:31:38] really? [01:31:46] Would've thought they would've timed out about then [01:31:50] https://phabricator.wikimedia.org/P4739 [01:32:01] Saw on cli when running `sql` [01:32:20] some cache not updated? [01:32:36] tests/cirrusTest.php also has the list of dblists, joy! [01:33:04] ostriches: /srv/mediawiki out of date on tin? [01:33:17] Maybe? Shouldn't be tho with all the syncs we've done [01:33:21] !log running scap pull on tin [01:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:04] I'm sure I've seen this before [01:34:05] (03PS3) 10Dereckson: Explicit dblist name for compact language links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315983 [01:34:13] Something weird with it not syncing -staging to not staging [01:34:38] Reedy: yes I've seen it to once, when I was creating a wiki [01:34:41] too [01:34:55] ostriches: scap pull fixed it on tin [01:34:56] (03CR) 10Dereckson: "PS3: +tests/cirrusTest.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315983 (owner: 10Dereckson) [01:34:59] stupid thing [01:35:08] (03CR) 10jerkins-bot: [V: 04-1] Explicit dblist name for compact language links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315983 (owner: 10Dereckson) [01:35:09] Yep, looks clean now [01:35:16] Should file a bug about that [01:35:20] if there's not one already [01:35:38] I thought staging to no staging was the first thing [01:36:05] Reedy: https://phabricator.wikimedia.org/T152005 [01:36:45] Oh look [01:37:40] (03PS3) 10Chad: beta: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327438 (owner: 10Legoktm) [01:38:09] I could've sworn we fixed that, hmm [01:38:13] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [01:38:13] Reedy: Rebased ^ [01:38:50] Krenair: NocDblistTest::testNocDblists has caught a change I rebased, works like a charm so [01:39:09] nice [01:39:13] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2798855 keys, up 72 days 17 hours - replication_delay is 0 [01:39:39] (03PS4) 10Dereckson: Explicit dblist name for compact language links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315983 [01:40:02] Dereckson, Reedy: Raised priority on T152005 [01:40:03] T152005: /srv/mediawiki on tin not being updated when using scap sync-file - https://phabricator.wikimedia.org/T152005 [01:40:18] * Dereckson nods [01:40:40] thanks for merging that btw guys [01:40:56] what about https://gerrit.wikimedia.org/r/#/c/298397/ ? :p [01:40:57] Merged all the things today! [01:41:50] (03CR) 10Dereckson: "PS4: +noc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315983 (owner: 10Dereckson) [01:45:20] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/mediawiki-config+-label:Code-Review%253C%253D-1+-label:Verified-1 is actually somehow only one page [01:45:22] impressive [01:45:58] and your watching query only has two changes [01:46:29] Krenair: Yep, that was the goal :) [01:46:57] Heck, even removing the label query still is one page, no Next link :D [01:47:01] Reedy: Go us! [01:47:22] (03Abandoned) 10Dereckson: Throttle user edits to 1000 per minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316980 (https://phabricator.wikimedia.org/T56515) (owner: 10Dereckson) [01:48:03] (03CR) 10Alex Monk: [C: 04-1] "temporary -1 while task is still being dealt with by DBA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314792 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [01:51:20] (03Abandoned) 10Chad: MWMultiversion cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309366 (owner: 10Chad) [01:52:04] https://gerrit.wikimedia.org/r/#/c/309742/ ← so how do we call squid.php? Last time we had ReverseProxy.php and CachingProxy.php as proposals [01:52:06] Why do we repeat the dblist tag reading code in tests/cirrusTest.php [01:52:17] Krenair: I'm writing a class to store this list [01:52:21] ok [01:52:49] Was asking myself the same question when I updated clldefault → compact-language-links dblist change [01:53:40] yeah that's where I noticed [01:55:03] (03Abandoned) 10Chad: Enable Education Program extension at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309062 (https://phabricator.wikimedia.org/T144927) (owner: 10محمد شعیب) [01:55:34] (03CR) 10Alex Monk: "Is this ready to go?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester) [01:56:22] Dereckson: "cachestuff.php" ;-) [01:56:54] (03CR) 10Chad: "Talked in person, tldr: no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester) [01:56:57] cp.php [01:57:00] varnish.php [01:57:04] I don't mind really [01:57:12] squidvarnishcp.php [01:57:12] :D [01:57:20] cache-flavor-of-the-year.php [01:57:49] (03CR) 10Alex Monk: "Let's leave either a commit message marker or a negative CR to indicate that then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester) [01:58:23] Krenair: Context was something something reading team. [01:58:26] I dunno, ask James_F [01:58:41] :-) [01:59:21] (03CR) 10Chad: [C: 04-2] "Per talking to Krinkle in person, this probably isn't a great idea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [01:59:53] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2935519 (10Papaul) [02:00:13] 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2918062 (10Papaul) a:05Papaul>03RobH [02:01:21] James_F: Is https://gerrit.wikimedia.org/r/301129 ready to go? [02:01:56] What Chad said. [02:02:28] (03CR) 10Alex Monk: "Is this change done/ready?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329762 (owner: 10Matěj Suchánek) [02:02:42] James_F: Something something reading team? [02:02:50] Or I dunno? [02:02:54] Yup. [02:03:07] Okay, this greatly clarifies things. [02:03:09] It's Reading's responsibility and it's stuck in limbo. [02:03:56] Adding -ownerin:wmf-deployment also hugely reduces the size of this gerrit query [02:04:32] Though if I recall correctly, also excludes James_F's changes [02:04:51] Indeed. :-( [02:04:56] * James_F sniffs. [02:05:38] (03Abandoned) 10Chad: Allow 'block' AbuseFilterAction on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239455 (https://phabricator.wikimedia.org/T113096) (owner: 10Platonides) [02:06:08] James_F clutters the review list with his patches that won't land for a year :p [02:06:38] True. [02:06:44] Statements of intent. :-) [02:07:03] See also half the code in VisualEditor. [02:08:10] At least the backlog is manageable now :D [02:09:00] Krenair: seems the test doesn't care about wiktionary, wikiquote, etc. [02:09:14] this is the cirrusTest thing? [02:09:18] yes [02:09:49] still probably best to just run the same code [02:10:20] I concur and list of lists will be more maintainable. [02:10:33] Dereckson: What's your thoughts on https://gerrit.wikimedia.org/r/308281 ? [02:11:25] they have more rights than other interface-editor groups [02:11:38] limits aren't important, abuse filter is [02:12:50] I'd create a 'technical administrator' group, and use it for ru., here, and all other "we want some sysop but only for technical stuff" variants [02:14:05] I agree with MarcoAurelio less groups we have, better will be the l10n [02:14:23] (more a matter to reuse an already well translated label) [02:15:57] (03CR) 10Chad: "If we still want this, needs a major rebase against master. The arrays have long since been fixed, for example." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271936 (owner: 10Jforrester) [02:17:04] (03CR) 10Chad: "Is there a reason we can't enable this in beta?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox) [02:17:38] (03CR) 10Chad: [C: 032] beta: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327438 (owner: 10Legoktm) [02:19:05] (03Merged) 10jenkins-bot: beta: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327438 (owner: 10Legoktm) [02:19:15] (03CR) 10jenkins-bot: beta: Set $wgLinterStatsdSampleFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/327438 (owner: 10Legoktm) [02:21:22] * ostriches throws a rock at l10nupdate [02:27:33] ostriches, https://gerrit.wikimedia.org/r/#/c/330709/ needs rebasing [02:27:49] Can't land yet anyway [02:27:52] Dependency hasn't [02:28:11] Oh, that's the puppet bit [02:28:25] Well, rebasing doesn't make a difference, Filippo said he wasn't gonna land +deploy it today [02:30:03] ok [02:31:08] would be good if ops could do something similar to today but with the puppet repo [02:31:08] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 11m 12s) [02:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:24] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jan 12 02:36:23 UTC 2017 (duration 5m 15s) [02:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:35] ostriches, hey you know what needs doing [02:36:39] interwiki.php update [02:36:40] https://phabricator.wikimedia.org/T154920#2930522 [02:36:52] also https://phabricator.wikimedia.org/T154225 [02:37:15] Oh snap, how do I do that again? Been awhile :p [02:37:31] !log demon@tin Synchronized wmf-config/InitialiseSettings-labs.php: no-op, completeness (duration: 00m 38s) [02:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:44] (03PS1) 10Dereckson: Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 [02:37:47] gotta run dumpInterwiki.php [02:38:16] specify output file, download it to your mediawiki-config dir, upload as commit [02:38:18] then deploy [02:38:58] (03CR) 10jerkins-bot: [V: 04-1] Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [02:39:14] Download, then upload? [02:39:22] * ostriches does it all from tin like a boss [02:39:46] well [02:40:08] I guess you can use a temporary HTTPS password to upload from tin [02:40:45] (03PS1) 10Chad: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331826 [02:41:02] Krenair: I upload changes from tin all the time :p [02:41:05] Saves me round-trips [02:41:11] Plz review ^ [02:44:04] (03CR) 10Chad: [C: 032] Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331826 (owner: 10Chad) [02:44:15] ostriches: Thanks for the review [02:44:18] You're welcome [02:45:24] (03Merged) 10jenkins-bot: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331826 (owner: 10Chad) [02:45:43] (03CR) 10jenkins-bot: Updating interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331826 (owner: 10Chad) [02:45:58] lgtm [02:46:30] !log demon@tin Synchronized wmf-config/interwiki.php: T154225 (duration: 00m 38s) [02:46:32] why multiversion/vendor/ is commit in the repo? As composer.lock is committed, it should recreate the same content [02:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:35] T154225: Update interwiki map, following edit - https://phabricator.wikimedia.org/T154225 [02:47:35] Because deployment servers can't/won't/shouldn't be downloading things from packagist :) [02:47:42] cf: mediawiki/vendor [02:52:02] we should change the topic back to status up [02:53:29] And with that, I'm out for the night. Later [03:07:37] (03PS2) 10Dereckson: Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 [03:08:49] (03CR) 10jerkins-bot: [V: 04-1] Consolidate database lists list in one place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331825 (owner: 10Dereckson) [03:23:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 693.15 seconds [03:29:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 220.98 seconds [04:37:43] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:38:33] PROBLEM - restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:39:33] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [04:39:33] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [04:45:23] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:45:33] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:33] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:45:43] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:46:33] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [04:46:33] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [04:46:33] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [04:57:43] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:23] PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:41] ... [04:58:43] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:58:43] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:43] PROBLEM - Host db1055 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:43] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:43] PROBLEM - Host db1056 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:43] PROBLEM - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:44] PROBLEM - Host db1051 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:44] PROBLEM - Host analytics1029 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:45] PROBLEM - Host db1060 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:45] PROBLEM - Host db1054 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:46] PROBLEM - Host db1057 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:46] PROBLEM - Host db1059 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:47] PROBLEM - Host es1015 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:47] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [04:59:03] PROBLEM - configured eth on lvs1001 is CRITICAL: eth2 reporting no carrier. [04:59:05] PROBLEM - LVS HTTP IPv4 on prometheus.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.25 and port 80: No route to host [04:59:13] PROBLEM - configured eth on lvs1002 is CRITICAL: eth2 reporting no carrier. [04:59:13] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1186.eqiad.wmnet because of too many down!: api-https_443 - Could not depool server mw1198.eqiad.wmnet because of too many down!: api_80 - Could not depool server mw1205.eqiad.wmnet because of too many down!: appservers-https_443 - Could not depool server mw1179.eqiad.wmnet because of too many down! [04:59:23] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw1265.eqiad.wmnet because of too many down!: api-https_443 - Could not depool server mw1235.eqiad.wmnet because of too many down!: zotero_1969 - Could not depool server sca1003.eqiad.wmnet because of too many down!: api_80 - Could not depool server mw1282.eqiad.wmnet because of too many down! [04:59:23] RECOVERY - Host labstore1004 is UP: PING WARNING - Packet loss = 61%, RTA = 0.46 ms [04:59:23] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [04:59:33] RECOVERY - Host db1057 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [04:59:33] RECOVERY - Host db1054 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [04:59:37] RECOVERY - Host analytics1029 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [04:59:43] RECOVERY - Host analytics1031 is UP: PING OK - Packet loss = 0%, RTA = 26.94 ms [04:59:55] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.006 second response time [05:00:03] RECOVERY - configured eth on lvs1001 is OK: OK - interfaces up [05:00:05] RECOVERY - LVS HTTP IPv4 on prometheus.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 156 bytes in 0.005 second response time [05:00:05] RECOVERY - configured eth on lvs1003 is OK: OK - interfaces up [05:00:13] RECOVERY - configured eth on lvs1002 is OK: OK - interfaces up [05:00:13] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 10.568 second response time [05:00:23] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [05:00:23] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [05:01:03] PROBLEM - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/etcd/flannel - 341 bytes in 0.003 second response time [05:01:13] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/labs-dns/private - 341 bytes in 0.004 second response time [05:01:23] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:01:33] PROBLEM - Redis set/get on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/redis - 341 bytes in 0.003 second response time [05:01:33] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 341 bytes in 0.002 second response time [05:01:33] PROBLEM - showmount succeeds on a labs instance on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/showmount - 341 bytes in 0.005 second response time [05:01:35] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 341 bytes in 0.003 second response time [05:01:37] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/self - 341 bytes in 0.003 second response time [05:01:43] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 341 bytes in 0.002 second response time [05:02:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [05:02:03] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:02:13] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:02:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [05:02:23] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:02:23] RECOVERY - All Flannel etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 24.844 second response time [05:02:23] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 15.919 second response time [05:02:33] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.007 second response time [05:02:33] RECOVERY - Redis set/get on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.010 second response time [05:02:33] RECOVERY - showmount succeeds on a labs instance on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.024 second response time [05:02:35] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.032 second response time [05:02:37] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.032 second response time [05:02:43] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.007 second response time [05:03:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:03:32] irc from my phone! (it seems like this was a monitoring bug with false positive alerts and not a site failure.) [05:04:06] i'll be on laptop in about 10 minutes (unless another opsen shows up) im not comfortable taking laptop out on ac transit bus ;) [05:04:16] Haha >< [05:04:26] Yeah, that wouldn't be good... [05:04:36] ummm hello [05:05:08] also seems nothing is down, just a bunch of alerts and clears, so not worth getting off the bus at a random stop that isnt mine to try to troubleshoot. [05:05:10] looks like a monitoring bug [05:05:11] yeah [05:05:30] robh: nope i'm around etc, do your thing :) [05:05:58] cool, i just setup irc on my phone yesterday to connect to my bouncer, so was excuse to have something to do on the bus ;] [05:06:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [05:06:38] ok, out until im home and things look ok anyhow. [05:08:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:10:33] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [05:11:03] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:12:23] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:13:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:14:46] was that a network issue or something? [05:15:23] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [05:15:50] not sure - bunch of NTP CRITICAL: No response from NTP server reports on icinga [05:17:06] I just got a user complaining about replag [05:17:18] db servers becoming unavailable might cause that? [05:17:26] look like almost all the hosts that went "down" are on C2 on eqiad, so maybe the switch got restarted? [05:17:33] Krenair: let me check [05:18:11] Krenair: no DB has lag right now, when was the complain? [05:18:27] I know [05:18:49] around about 05:01-05:06 ish [05:19:19] I'd check if the hosts were on the same rack, yeah [05:21:10] they are [05:23:20] including the master of s1 :( [05:24:13] yeah that going down would drop enwiki into read-only mode [05:24:55] https://status.wikimedia.org/178333/Wiki-platform-[[w:dsb:Main-Page]]-(s3)---UNCACHED <--- FYI ? [05:24:57] from -tech [05:25:28] am guessing that was the result of one of the lvs hosts being there [05:26:35] (03PS1) 10Andrew Bogott: Keystone: Turn down default log levels [puppet] - 10https://gerrit.wikimedia.org/r/331830 [05:27:11] Krenair: no, no lvs is in that rack [05:27:51] (03CR) 10Andrew Bogott: [C: 032] Keystone: Turn down default log levels [puppet] - 10https://gerrit.wikimedia.org/r/331830 (owner: 10Andrew Bogott) [05:28:03] hmm [05:28:10] there were some lvs-related alerts in icinga [05:28:16] none for cp hosts [05:28:42] yes, I saw that too, it should be because the eth2 of lvs1001 is connected there but I need to verifyEinsOlogy9-CosmIvity1+RelatTein5$ [05:28:51] pastefail [05:29:03] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:29:13] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [05:29:23] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [05:30:26] cache_text was affected between 4:56 and 5:01, all good on cache_upload [05:37:33] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:34:23] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:37:59] Bsadowski1: You've never taken AC Transit, you REALLY don't want to take a laptop out on it, for a wide variety of reasons.. :) [06:39:00] robh: You were wise to not do so. :) [06:40:13] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:44:03] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:51:43] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:45] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:33] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [06:52:33] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [06:56:03] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:04:03] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:04:03] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:22:03] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [07:27:23] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [07:28:03] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [07:32:03] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:48:13] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [08:36:42] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Use LABS_NETWORKS in ferm rsync rule [puppet] - 10https://gerrit.wikimedia.org/r/331622 (owner: 10Alexandros Kosiaris) [08:36:48] (03PS2) 10Alexandros Kosiaris: osm: Use LABS_NETWORKS in ferm rsync rule [puppet] - 10https://gerrit.wikimedia.org/r/331622 [08:36:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] osm: Use LABS_NETWORKS in ferm rsync rule [puppet] - 10https://gerrit.wikimedia.org/r/331622 (owner: 10Alexandros Kosiaris) [08:37:12] (03PS2) 10Alexandros Kosiaris: puppetdb: Do not set up Ganglia in Labs [puppet] - 10https://gerrit.wikimedia.org/r/329329 (https://phabricator.wikimedia.org/T154104) (owner: 10Tim Landscheidt) [08:37:19] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetdb: Do not set up Ganglia in Labs [puppet] - 10https://gerrit.wikimedia.org/r/329329 (https://phabricator.wikimedia.org/T154104) (owner: 10Tim Landscheidt) [08:38:26] good morning [08:38:41] akosiaris: I went crazy yesterday and fixed a bunch of rspec changes :D [08:38:51] rspec tests [08:40:57] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "I had a look at the PropertySuggester source code. It does *not* check if a property exists, because this would be to expensive. As long a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/329762 (owner: 10Matěj Suchánek) [08:41:53] PROBLEM - Check if rsync server is running on labsdb1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name rsync, regex args /usr/bin/rsync --no-detach --daemon [08:43:24] (03CR) 10Alexandros Kosiaris: "It's a system group that exists by default on all installations, and the idea is to use it as is, so there isn't much we can do about the " [puppet] - 10https://gerrit.wikimedia.org/r/331602 (owner: 10Alexandros Kosiaris) [08:47:09] hashar: yes I noticed. I am looking at https://gerrit.wikimedia.org/r/#/c/331677/1/modules/nrpe/spec/defines/monitor_service_spec.rb,unified right now [08:47:24] trying to remember why I had those "pending" [08:48:37] so.. now rspec despite the "pending" [08:48:46] runs the tests ? [08:49:01] and if it is ok it just errors out, if it is not it honors the pending ? [08:49:02] lol [08:49:52] (03CR) 10Alexandros Kosiaris: [C: 032] nrpe: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331677 (owner: 10Hashar) [08:49:55] (03PS2) 10Alexandros Kosiaris: nrpe: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331677 (owner: 10Hashar) [08:49:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] nrpe: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331677 (owner: 10Hashar) [08:50:12] akosiaris: so yeah rspec 3 always run the test [08:50:26] pending() is a merely to flag the test as it is going to fail [08:50:44] IF the spec pass and it is flagged pending(), then it is marked as failling [08:50:53] cause since it pass, it is no more pending :} [08:50:56] something fixed it up somehow [08:51:00] yeah makes sense [08:51:23] I think nrpe is one the my first rspec module tests [08:51:29] so in theory before refactoring code, someone can write the expected behaviors to achieve and mark them all as pending [08:51:38] probably not the very best ones [08:51:45] thanks for handling this! [08:51:51] and whenever they fail, that means a feature has been implemented properly [08:52:13] omg did I just write "not the very best ones" ? [08:52:13] there is a few patches to run all the spec form the root of the repo. Will attempt to have one final nice patch for review [08:52:21] then probably write some doc [08:52:26] that was a translation straight from greek and it's wrong also [08:52:27] lol [08:52:29] and hopefully get Jenkins to run the spec finally [08:52:31] haha [08:52:46] * akosiaris needs more coffee [08:52:54] none of us are native english speakers anyway :) [08:53:11] hashar: so yeah.. generally I 've altered slightly my opinion about rspec tests [08:53:21] so e.g. that test is practically stupid [08:53:33] all it does is re implement the actual class [08:53:38] which is a really simple class [08:53:42] there isn't much to test there [08:53:56] tests in that very simple case only hinder refactoring [08:54:10] and given we don't enforce them anyway via jenkins [08:54:15] people just utterly ignore them [08:54:29] they don't even remember/know they exist [08:54:46] in other cases, with many codepaths (case statements, if/thens and so on) [08:54:52] tests make way more sense [08:55:09] or puppet parser functions and so on [08:55:34] I am thinking we should go through with a process of cleaning up our tree [08:55:42] kill the useless tests [08:55:45] like this one [08:55:53] and then enable the tests in jenkins [08:55:56] (03CR) 10Hashar: "That was supposed to be covered by a RewriteRule https://gerrit.wikimedia.org/r/#/c/322019/5/modules/contint/templates/apache/doc.wikimedi" [puppet] - 10https://gerrit.wikimedia.org/r/331558 (https://phabricator.wikimedia.org/T150727) (owner: 10Krinkle) [08:57:51] akosiaris: yeah duplicating the implementation in a test is worthless [08:57:55] I agree with that and other ops have the same concern [08:58:12] when I refactored the Zuul class to use hiera, I wrote a set of spec and that greatly helped [08:58:29] specially to assert the proper hiera key got loaded and the resulting erb template compiled/expanded properly [08:59:30] !log disabling puppet on contint1001 to live hack apache conf ( T150727 ) [08:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:35] T150727: doc.wikimedia.org displays "403 Forbidden" for coverage sub directories - https://phabricator.wikimedia.org/T150727 [09:02:27] 06Operations, 10MediaWiki-Vagrant: Upgrade Vagrant to 1.9.1 in Wikimedia apt for both Trusty and Jessie - https://phabricator.wikimedia.org/T155112#2935815 (10akosiaris) [09:16:35] !log T155112 upload Vagrant 1.9.1 to apt.wikimedia.org/jessie-wikimedia/thirdparty and apt.wikimedia.org/trusty-wikimedia/thirdparty [09:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:40] T155112: Upgrade Vagrant to 1.9.1 in Wikimedia apt for both Trusty and Jessie - https://phabricator.wikimedia.org/T155112 [09:16:58] 07Puppet, 06Labs, 10MediaWiki-Vagrant, 13Patch-For-Review, 15User-bd808: Make role::labs::mediawiki_vagrant work on Debian Jessie host systems - https://phabricator.wikimedia.org/T154340#2935839 (10akosiaris) [09:17:00] 06Operations, 10MediaWiki-Vagrant: Upgrade Vagrant to 1.9.1 in Wikimedia apt for both Trusty and Jessie - https://phabricator.wikimedia.org/T155112#2935836 (10akosiaris) 05Open>03Resolved a:03akosiaris @bd808, yes in fact I 've done it already. [09:24:57] (03PS1) 10Alexandros Kosiaris: oresrdb.svc.eqiad.wmnet: Point to oresrdb1002 [dns] - 10https://gerrit.wikimedia.org/r/331835 [09:31:23] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:38:32] (03PS1) 10Alexandros Kosiaris: Revert "oresrdb.svc.eqiad.wmnet: Point to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/331837 [09:43:22] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2935854 (10Mvolz) a:05Mvolz>03None [09:56:30] (03CR) 10Hashar: [V: 031 C: 031] "I have downloaded the pson catalogs for each of the hosts and manually did a diff. They are all noop :-}" [puppet] - 10https://gerrit.wikimedia.org/r/331457 (owner: 10Hashar) [09:59:23] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [10:00:13] (03PS3) 10Hashar: kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 [10:01:25] (03CR) 10Alexandros Kosiaris: [C: 032] oresrdb.svc.eqiad.wmnet: Point to oresrdb1002 [dns] - 10https://gerrit.wikimedia.org/r/331835 (owner: 10Alexandros Kosiaris) [10:03:43] (03CR) 10Hashar: "Rebased to run the puppet compiler against the kafka1001-1003 and kafka2001-2003 hosts: https://puppet-compiler.wmflabs.org/5083/ it is no" [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [10:05:00] (03PS4) 10Hashar: kafka: fix Unrecognized escape sequence '\.' [puppet] - 10https://gerrit.wikimedia.org/r/331451 [10:20:03] PROBLEM - DPKG on oresrdb1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:23:13] RECOVERY - DPKG on oresrdb1001 is OK: All packages OK [10:23:13] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:24:19] (03CR) 10Hashar: "$whitelist_tail was missing the underscore. kafka2003.codfw.wmnet catalog compilation failed twice, might be unrelated(?)" [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [10:24:43] (03CR) 10Hashar: [V: 031 C: 031] "Ah Unable to find facts for host kafka2003.codfw.wmnet, skipping :}" [puppet] - 10https://gerrit.wikimedia.org/r/331451 (owner: 10Hashar) [10:36:13] PROBLEM - puppet last run on mc1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:37:48] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "oresrdb.svc.eqiad.wmnet: Point to oresrdb1002" [dns] - 10https://gerrit.wikimedia.org/r/331837 (owner: 10Alexandros Kosiaris) [10:48:06] (03PS1) 10Odder: Add noratelimit user right to translation admins on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) [10:50:29] (03CR) 10Odder: [C: 04-1] "Let's wait for community consensus or at least an announcement on a local Village Pump before this is deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331841 (https://phabricator.wikimedia.org/T155162) (owner: 10Odder) [10:50:48] 07Puppet, 10Continuous-Integration-Config, 13Patch-For-Review: rake-jessie tests check .pp files but are not triggered by .pp file changes - https://phabricator.wikimedia.org/T153013#2935931 (10hashar) 05Open>03Resolved a:03Paladox [10:51:13] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:53:58] (03PS1) 10Alexandros Kosiaris: ores: Add redis database to client_hosts hiera key [puppet] - 10https://gerrit.wikimedia.org/r/331843 [10:55:18] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add redis database to client_hosts hiera key [puppet] - 10https://gerrit.wikimedia.org/r/331843 (owner: 10Alexandros Kosiaris) [11:05:13] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [11:34:23] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [11:52:23] PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:57:44] (03PS1) 10Urbanecm: [cleanup] Remove old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331844 [12:01:29] (03Abandoned) 10Urbanecm: [cleanup] Remove old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331844 (owner: 10Urbanecm) [12:03:23] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:09:33] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [12:20:23] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:37:33] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [12:42:27] (03PS2) 10Hashar: mirrors: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331639 [12:45:20] (03PS1) 10Urbanecm: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331846 (https://phabricator.wikimedia.org/T150618) [13:04:42] (03PS1) 10Hashar: bacula: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331847 [13:10:18] (03PS1) 10Hashar: backup: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331848 [13:31:55] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [13:35:36] PROBLEM - Redis status tcp_6381 on rdb2004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.16.123 on port 6381 [13:36:36] RECOVERY - Redis status tcp_6381 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6381 has 1 databases (db0) with 7423080 keys, up 73 days 5 hours - replication_delay is 0 [13:36:36] PROBLEM - Redis status tcp_6379 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6379 [13:36:36] PROBLEM - Redis status tcp_6380 on rdb2004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.16.123 on port 6380 [13:37:06] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:06] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:26] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [13:37:26] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 37 probes of 400 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:37:26] RECOVERY - Redis status tcp_6379 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6379 has 1 databases (db0) with 2808630 keys, up 73 days 5 hours - replication_delay is 0 [13:37:36] RECOVERY - Redis status tcp_6380 on rdb2004 is OK: OK: REDIS 2.8.17 on 10.192.16.123:6380 has 1 databases (db0) with 7510819 keys, up 73 days 5 hours - replication_delay is 0 [13:37:56] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [13:37:56] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [13:38:16] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp2005_v4 [13:38:16] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp2011_v4 [13:38:16] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 53 not-conn: cp2005_v4, cp2011_v4, cp2017_v4 [13:38:26] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 2807764 keys, up 73 days 5 hours - replication_delay is 0 [13:38:26] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2016_v4 [13:38:36] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp2001_v4, cp2004_v4, cp2019_v4, cp2023_v4 [13:39:06] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2003_v4, cp2015_v4 [13:39:06] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2019_v4 [13:39:06] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2016_v4 [13:39:06] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2009_v4 [13:39:06] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2003_v4, cp2021_v4 [13:39:07] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp2022_v4 [13:39:07] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp2004_v4, cp2010_v4, cp2013_v4, cp2023_v4 [13:39:08] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp2022_v4 [13:39:08] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2010_v4, cp2019_v4 [13:39:16] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp2008_v4 [13:39:16] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 55 not-conn: cp2011_v4 [13:39:16] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2008_v4, cp2024_v4 [13:42:26] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 400 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:43:16] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [13:46:06] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [13:46:47] (03PS12) 10Hashar: Modification of Rakefile spec entry point [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [13:46:50] (03PS1) 10Hashar: (DO NOT SUBMIT) Octopus merge of spec fixes [puppet] - 10https://gerrit.wikimedia.org/r/331850 [13:48:18] (03PS5) 10Hashar: Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 [13:48:26] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [13:48:45] (03CR) 10jerkins-bot: [V: 04-1] Modification of Rakefile spec entry point [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [13:50:36] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [13:50:54] (03CR) 10jerkins-bot: [V: 04-1] Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar) [13:51:16] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [13:51:43] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#2936094 (10hashar) [13:52:16] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK [13:52:38] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#842870 (10hashar) I have been sprinting that a bit this week. Namely fixed a few spec and rebased Nico patch. Results are under the Gerrit topic... [13:53:16] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [13:53:36] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:57:07] (03PS1) 10Hashar: (DO NOT SUBMIT) test git submodule update [puppet] - 10https://gerrit.wikimedia.org/r/331853 [13:57:16] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [13:59:16] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [13:59:24] (03Abandoned) 10Hashar: (DO NOT SUBMIT) test git submodule update [puppet] - 10https://gerrit.wikimedia.org/r/331853 (owner: 10Hashar) [14:00:16] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [14:03:01] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [14:03:16] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [14:03:16] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [14:03:16] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [14:04:06] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [14:04:06] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [14:05:34] akosiaris: rspec of all modules on CI and passing!! https://integration.wikimedia.org/ci/job/operations-puppet-rake-jessie/2726/console :} [14:06:16] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [14:09:16] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [14:21:36] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:22:03] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar) [14:55:32] (03PS1) 10Hashar: (WIP) Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 [15:03:16] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:18] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2936350 (10zhuyifei1999) (Un-stalled the requests for server side uploads) [15:28:27] (03PS2) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [15:31:36] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:34:27] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [15:35:16] (03CR) 10jerkins-bot: [V: 04-1] build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [15:39:02] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar) [15:40:18] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [15:45:27] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#2936364 (10hashar) [15:45:42] 06Operations, 10Continuous-Integration-Config, 13Patch-For-Review: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#842870 (10hashar) 05stalled>03Open a:03hashar [15:51:16] (03PS1) 10Paladox: Update mysql-connector-java to 5.1.40 [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 [15:54:01] (03CR) 10Alexandros Kosiaris: [C: 032] bacula: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331847 (owner: 10Hashar) [15:54:23] (03CR) 10Alexandros Kosiaris: [C: 032] backup: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331848 (owner: 10Hashar) [15:54:30] (03PS2) 10Alexandros Kosiaris: backup: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331848 (owner: 10Hashar) [15:54:32] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] backup: fix spec [puppet] - 10https://gerrit.wikimedia.org/r/331848 (owner: 10Hashar) [15:55:08] akosiaris: I am going to write a note to the ops list [15:55:33] but in short the holygrail is having rspec to craft JUnit reports for Jenkins to interpret [15:55:35] https://integration.wikimedia.org/ci/job/operations-puppet-rake-jessie/2732/testReport/ \O/ [15:56:03] hashar: \o/ [15:56:22] I have doubt our spec are any helpful though :( [15:56:39] and there is a few modules that are highly coupled with everything. tilerator is an example [15:56:52] you end up needing base / conftool / etcd etc [15:56:52] :( [15:57:10] maybe in the modules we could use mock modules that are essentially empty [15:57:26] and at the root of puppet.git, have integration tests that play with all modules [15:59:59] so, have you seen the new RFC ? [16:00:08] all that coupling should become more loose [16:00:16] the role/profile/module paradigm [16:00:26] allows to couple more loosely such things [16:00:34] yeah [16:00:50] I thought of Joe change to be rather complicated [16:00:57] and merely move bits around / introducing yet another level of inception [16:01:08] at first yeah, that's what it looks like [16:01:12] but now that I have looked a bit more at our modules coupling it makes total sense [16:01:21] exactly [16:01:22] so each module would have spec that at worth just use stdlib and wmflib [16:01:50] maybe some other very intrusive ones like some defines from apache [16:01:53] or monitoring [16:02:05] but the scope of these is meant to be very very limited [16:02:13] the roles using several modules, the spec helper would point to /modules (eg do not use the fixture system and just use any module) [16:02:25] and at root of the profile module, we would get some end to end integration tests [16:02:52] like profile mediawiki::appserver it { should.contain_package['hhvm'].with_version('3.4.4') } [16:02:55] or something like that [16:03:20] monitoring I don't think it should be done in modules [16:03:26] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.13 seconds [16:03:33] but yeah overall that is exciting [16:04:36] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [16:06:16] hashar: generally speaking ? yes it shouldn't [16:06:34] but it may not be easy to find the easiest construct for that [16:06:36] that is a things to fix for the future generation [16:06:45] monitoring is a very weird thing [16:06:58] you want to be very pervasive and present without even caring for it [16:07:13] which when testing, makes everything go awry [16:07:46] it will need some better experience with the paradigm and some design [16:10:09] (03CR) 10Paladox: "I got the jar directly from https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.40/" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [16:13:40] (03PS2) 10Hashar: (DO NOT SUBMIT) Octopus merge of spec fixes [puppet] - 10https://gerrit.wikimedia.org/r/331850 [16:13:43] (03Draft1) 10Paladox: Gerrit: Remove mysql-connection-java apt package [puppet] - 10https://gerrit.wikimedia.org/r/331864 [16:13:46] (03PS2) 10Paladox: Gerrit: Remove mysql-connection-java apt package [puppet] - 10https://gerrit.wikimedia.org/r/331864 [16:14:27] (03PS13) 10Hashar: Modification of Rakefile spec entry point [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [16:14:43] (03PS6) 10Hashar: Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 [16:14:51] (03PS3) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [16:15:11] (03CR) 10Paladox: "@Chad should i remove depends on since we can merge this as it won't remove mysql-connection-java jar as it would need manual removal." [puppet] - 10https://gerrit.wikimedia.org/r/331864 (owner: 10Paladox) [16:15:22] akosiaris: I will let things settle a bit and write about rspec tomorrow. Thank you for the reviews! ;} [16:15:36] hashar: thanks as well [16:15:37] ! [16:15:40] :-) [16:18:59] eek [16:19:08] the lvm module relies on stdlib [16:19:27] and the .fixtures.yml file uses the github repo https://github.com/puppetlabs/puppetlabs-stdlib.git [16:19:31] not quite what we have ;} [16:23:04] (03CR) 10Chad: "Is this available via debian unstable or testing perhaps? Using the provided debian package is kind of nice...less stuff to bundle in the " [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [16:23:32] (03CR) 10Chad: [C: 04-1] "See comments I left on the dependency" [puppet] - 10https://gerrit.wikimedia.org/r/331864 (owner: 10Paladox) [16:24:16] (03CR) 10Paladox: "> Is this available via debian unstable or testing perhaps? Using the" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [16:25:26] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 46.95 seconds [16:32:36] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:34:38] (03CR) 10Chad: "Yeah, I saw the release notes...I'm just curious though in the delta between 5.1.21 and 5.1.40, are there particular bugfixes or features " [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [16:35:26] (03CR) 10Paladox: "Well between those releases it fixes something to do if you set utf8mb4 on the server jdbc did not work correctly." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [16:35:36] (03CR) 10Paladox: "also adds something to do with alter table." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [16:54:43] 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10Papaul) [16:55:35] 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936470 (10Papaul) [16:57:53] (03PS1) 10Alexandros Kosiaris: osm: Fix osm rsync server check [puppet] - 10https://gerrit.wikimedia.org/r/331878 [17:04:36] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:54] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Fix osm rsync server check [puppet] - 10https://gerrit.wikimedia.org/r/331878 (owner: 10Alexandros Kosiaris) [17:10:02] (03PS2) 10Alexandros Kosiaris: osm: Fix osm rsync server check [puppet] - 10https://gerrit.wikimedia.org/r/331878 [17:10:45] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] osm: Fix osm rsync server check [puppet] - 10https://gerrit.wikimedia.org/r/331878 (owner: 10Alexandros Kosiaris) [17:12:46] 06Operations, 10ops-codfw: codfw:mw2251-mw2260 switch port configuration - https://phabricator.wikimedia.org/T155181#2936476 (10Papaul) [17:17:25] 06Operations, 10ops-codfw: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936490 (10Papaul) p:05Triage>03Normal [17:19:36] PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:26:28] (03CR) 10Anomie: "> but suddenly they're making edits while logged out with a higher frequency." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (https://phabricator.wikimedia.org/T154698) (owner: 10Anomie) [17:32:36] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:33:16] RECOVERY - Check if rsync server is running on labsdb1006 is OK: PROCS OK: 1 process with command name rsync, regex args /usr/bin/rsync --no-detach --daemon [17:43:35] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2936504 (10Arthur2e5) Yes. [17:47:36] RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:52:50] Hi, the Notifications do not work at dewiki right now [17:53:01] has anybody an idea? [18:08:36] (03CR) 10Chad: [C: 032] Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331846 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [18:10:19] (03Merged) 10jenkins-bot: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331846 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [18:10:37] (03CR) 10jenkins-bot: Add HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331846 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [18:12:03] !log demon@tin Synchronized static/images/project-logos: HD logos for (nap|os|pl|pt)wiki (duration: 00m 39s) [18:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:05] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Use HD logos for (nap|os|pl|pt)wiki (duration: 00m 41s) [18:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:36] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:57:36] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:09:16] PROBLEM - Juniper alarms on mr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.154.199 [19:10:07] RECOVERY - Juniper alarms on mr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [19:14:13] (03PS1) 10Papaul: DNS: Add mgmt and prodcution DNS entres for mw2251-mw2260 Fix: Putting server in alphabetical order Bug:T155180 [dns] - 10https://gerrit.wikimedia.org/r/331903 [19:38:02] (03Draft1) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:06] (03Draft2) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:10] (03Draft3) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:16] (03Draft4) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:20] (03Draft5) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:24] (03Draft6) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:29] (03Draft7) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:34] (03Draft8) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:39] (03Draft9) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:44] (03Draft10) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:52] (03Draft11) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:55] (03Draft12) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:38:59] (03Draft13) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:39:04] (03Draft14) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:39:08] (03Draft15) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:39:12] (03Draft16) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:39:16] (03Draft17) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:39:19] (03Draft18) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:39:23] (03Draft19) 10Paladox: Test: Do not merge [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:39:37] (03PS20) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [19:45:53] (03CR) 10Krinkle: [C: 04-1] "Yeah, because afaik DirectoryIndex is about the current directory (e.g. how your typical index.php/index.html file works). Whereas what we" [puppet] - 10https://gerrit.wikimedia.org/r/331558 (https://phabricator.wikimedia.org/T150727) (owner: 10Krinkle) [19:46:14] (03CR) 10Krinkle: [C: 04-1] "I wonder what problem the rewrite rule caused in Apache 2.4? Afaik that should work just fine in either Apache version." [puppet] - 10https://gerrit.wikimedia.org/r/331558 (https://phabricator.wikimedia.org/T150727) (owner: 10Krinkle) [20:06:22] (03PS21) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [20:18:00] (03PS22) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [20:18:39] (03CR) 10Paladox: "recheck" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 (owner: 10Paladox) [20:20:10] (03PS23) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [20:26:19] (03PS24) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [20:36:13] (03PS25) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [20:47:08] (03PS26) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [21:02:06] (03PS27) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [21:02:56] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [21:09:17] (03PS1) 10Filippo Giunchedi: cassandra: add jmx_exporter to Cassandra in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/331911 (https://phabricator.wikimedia.org/T155120) [21:09:48] (03PS28) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [21:10:22] (03PS29) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [21:10:50] (03CR) 10Paladox: "recheck" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 (owner: 10Paladox) [21:15:03] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/5086/" [puppet] - 10https://gerrit.wikimedia.org/r/331911 (https://phabricator.wikimedia.org/T155120) (owner: 10Filippo Giunchedi) [21:30:47] (03PS30) 10Paladox: Fix debian's lintian test [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 [21:36:26] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:39:29] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, and 3 others: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2936925 (10matmarex) [21:46:13] (03CR) 10Paladox: "@Chad i think this is ready. Dosent fix all lintian failures, but fixes some of them :)" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 (owner: 10Paladox) [22:04:26] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:08:31] (03CR) 10Hashar: [C: 04-1] "When invoking "rake spec:all", if one of the module fail it does not run the others modules spec. And in parallel mode "rake -m", that is" [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar) [22:08:50] !log maxsem@tin Synchronized php-1.29.0-wmf.7/extensions/Graph/includes/ApiGraph.php: Debug for T155057 (duration: 00m 38s) [22:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:54] T155057: Graph: First parameter must either be an object or the name of an existing class - https://phabricator.wikimedia.org/T155057 [22:29:34] (03PS1) 10Chad: Removing old presenation files from wmfwiki docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331918 [22:29:51] (03CR) 10Reedy: [C: 031] "RIP" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331918 (owner: 10Chad) [22:31:46] (03CR) 10Chad: [C: 032] Removing old presenation files from wmfwiki docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331918 (owner: 10Chad) [22:33:11] (03PS2) 10Chad: Removing old presentation files from wmfwiki docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331918 [22:33:20] apergos: FINE ^ [22:35:08] (03CR) 10Chad: [C: 032] Removing old presentation files from wmfwiki docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331918 (owner: 10Chad) [22:36:39] (03Merged) 10jenkins-bot: Removing old presentation files from wmfwiki docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331918 (owner: 10Chad) [22:38:13] !log demon@tin Synchronized docroot/foundation/presentations: removing some of these powerpoints (duration: 00m 38s) [22:38:16] (03CR) 10jenkins-bot: Removing old presentation files from wmfwiki docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331918 (owner: 10Chad) [22:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:21] (03PS1) 10BearND: admin: update my production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/331920 [22:39:10] (03CR) 10jerkins-bot: [V: 04-1] admin: update my production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/331920 (owner: 10BearND) [22:42:41] (03PS2) 10BearND: admin: update my production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/331920 [22:44:21] (03PS1) 10Chad: Remove last of these powerpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331922 [22:44:31] (03CR) 10Dzahn: [C: 032] "this key was created while i was sitting next to Bernd at allhands :)" [puppet] - 10https://gerrit.wikimedia.org/r/331920 (owner: 10BearND) [22:46:02] (03CR) 10Chad: [C: 032] Remove last of these powerpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331922 (owner: 10Chad) [22:46:07] (03PS1) 10Papaul: DHCP: Add DHCP entries for mw2251-mw2260 Bug:T155180 [puppet] - 10https://gerrit.wikimedia.org/r/331923 [22:47:27] (03Merged) 10jenkins-bot: Remove last of these powerpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331922 (owner: 10Chad) [22:47:39] (03CR) 10jenkins-bot: Remove last of these powerpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331922 (owner: 10Chad) [22:48:06] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:56] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [22:49:15] !log demon@tin Synchronized docroot/foundation: Yay no more powerpoints (duration: 00m 38s) [22:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:03] wmfwiki docroot almost sane! [22:52:24] \o/ [22:54:23] (03CR) 10Reedy: "Nearly 500 commits... https://github.com/mysql/mysql-connector-j/compare/5.1.21...5.1.40" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [22:54:26] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:55:08] (03CR) 10Paladox: "Not to mention 6.x lol. That would probably break gerrit." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [22:55:25] (03CR) 10Paladox: "break gerrit as in the 6.x would break it." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331863 (owner: 10Paladox) [22:56:59] 06Operations, 10Ops-Access-Requests: Requesting access to hive/webrequest data for demon - https://phabricator.wikimedia.org/T155198#2937117 (10demon) [22:58:20] (03PS1) 10ArielGlenn: rsync for Erik Zachte from stat* hosts to dataset1001 other/media [puppet] - 10https://gerrit.wikimedia.org/r/331924 [22:58:35] (03PS1) 10Chad: Grant access to analytics-privatedata-users to demon [puppet] - 10https://gerrit.wikimedia.org/r/331925 (https://phabricator.wikimedia.org/T155198) [23:05:32] !log demon@tin Synchronized README: no-op for force co-master sync (duration: 00m 40s) [23:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:15] (03CR) 10Chad: "So, I'm not entirely sure if the copyright file is all correct or not, the licensing here is kinda unclear. Other files look fine." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 (owner: 10Paladox) [23:16:19] PROBLEM - MariaDB disk space on db1026 is CRITICAL: DISK CRITICAL - free space: /srv 73649 MB (4% inode=99%) [23:17:16] PROBLEM - Disk space on db1026 is CRITICAL: DISK CRITICAL - free space: /srv 57155 MB (3% inode=99%) [23:18:05] (03CR) 10Paladox: "I think the files in * are apache version 2 as gerrit is licensed under apache 2.x but debian/* is licensed under gpl 2.0+" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/331873 (owner: 10Paladox) [23:18:09] marostegui: is this worrying ^ [23:18:54] s5 slave [23:18:54] 'db1026' => 1, # 1.4TB 64GB, watchlist, recentchanges, contributions, logpager [23:19:16] Yeah, /srv is filling up fast, lost another 1% in like a minute? [23:19:36] What's in /srv on the db hosts? [23:20:08] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&c=MySQL+eqiad&h=db1026.eqiad.wmnet&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [23:20:51] I'm taking a look too [23:20:59] ta [23:20:59] marostegui is looking too [23:22:04] madhuvishy: nice, where are you? [23:22:35] godog: ventana - middle rows, left side [23:22:48] but yeah load spiked a few minutes ago [23:23:26] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:28:24] ACKNOWLEDGEMENT - MariaDB disk space on db1026 is CRITICAL: DISK CRITICAL - free space: /srv 74907 MB (5% inode=99%): Marostegui long running query sorting on a temp table [23:29:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [23:30:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:32:16] RECOVERY - Disk space on db1026 is OK: DISK OK [23:33:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [23:35:57] the fatals seem to be from the jobqueue machines with "Could not wait for replica DBs to catch up to db1049" [23:36:08] db1049? [23:36:17] that is the master? [23:36:20] for db1026 [23:36:29] yeah it is its master [23:36:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:37:19] RECOVERY - MariaDB disk space on db1026 is OK: DISK OK [23:38:09] seems to have subsided now though [23:38:33] yes, the queries are gone now [23:38:38] godog: Yeah, all the errors seem to be lag related in MW [23:38:39] https://tendril.wikimedia.org/host/view/db1026.eqiad.wmnet/3306 show spike in replag - and a drop [23:38:43] I was looking at https://logstash.wikimedia.org/goto/8b3389188a01d6a60453e1145f08ce15 fwiw [23:38:43] we'll need to investigate where are they coming from [23:38:53] (generally speaking, a lot of our errors recently have been lag related :() [23:39:00] I am going to compress a few tables to get more extra disk space [23:39:33] Have you got a big enough press? [23:39:43] looks like it is not the first time these queries appear: https://phabricator.wikimedia.org/T147747 [23:39:57] Reedy: I might need help if am not strong enough! [23:39:59] ApiQueryContributions is a *terrible* query [23:40:05] Frequent offender. [23:44:01] I have thrown 100G to the lv [23:44:10] And I will start the compression in a bit [23:45:53] (03PS2) 10Dzahn: DHCP: Add DHCP entries for mw2251-mw2260 Bug:T155180 [puppet] - 10https://gerrit.wikimedia.org/r/331923 (owner: 10Papaul) [23:47:08] (03CR) 10Dzahn: [C: 032] DHCP: Add DHCP entries for mw2251-mw2260 Bug:T155180 [puppet] - 10https://gerrit.wikimedia.org/r/331923 (owner: 10Papaul) [23:50:58] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and prodcution DNS entres for mw2251-mw2260 Fix: Putting server in alphabetical order Bug:T155180 [dns] - 10https://gerrit.wikimedia.org/r/331903 (owner: 10Papaul) [23:52:35] (03CR) 10Dzahn: "Ok, thanks. i'm hitting "abandon" then." [dns] - 10https://gerrit.wikimedia.org/r/325856 (owner: 10Papaul) [23:52:59] btw: https://phabricator.wikimedia.org/T154929 godog madhuvishy (just in case) [23:53:13] it shouldn't be needed after the 100G I gave it [23:53:18] but just in case [23:53:32] (03PS1) 10Chad: MWMultiversion: Move CLI entry point to class and out of MWVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331930 [23:55:52] (03PS1) 10Chad: MWMultiVersion: Use proper (new) cli entry point [puppet] - 10https://gerrit.wikimedia.org/r/331931 [23:56:45] ACKNOWLEDGEMENT - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. daniel_zahn currently not the deployment server [23:57:46] mutante: Shouldn't it be armed though? [23:58:03] mira's not the current default, but it's a legit master, no reason for it not to run hot & ready [23:59:28] ostriches: we are talking about that right now [23:59:42] * ostriches nods [23:59:42] yes