[00:02:54] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Setup backups of elasticsearch indices - https://phabricator.wikimedia.org/T91404#2651038 (10Dzahn) [00:17:31] 06Operations, 10OTRS: clean up unworking otrs email addresses - https://phabricator.wikimedia.org/T84044#2651056 (10Dzahn) [00:17:45] 06Operations, 10OTRS, 05WMF-NDA: clean up unworking otrs email addresses - https://phabricator.wikimedia.org/T84044#2651059 (10Krenair) [00:18:04] 06Operations, 10OTRS: clean up unworking otrs email addresses - https://phabricator.wikimedia.org/T84044#921922 (10Krenair) [00:18:34] 06Operations, 10OTRS: clean up unworking otrs email addresses - https://phabricator.wikimedia.org/T84044#921922 (10Dzahn) This ticket was imported from RT, that's why it got the restriction by default. Removed that. [00:19:07] 06Operations, 10OTRS: clean up unworking otrs email addresses - https://phabricator.wikimedia.org/T84044#2651064 (10Dzahn) 05stalled>03Open [00:19:57] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active, AS1299/IPv6: Active [00:20:25] hmm.. failing acme-setup on carbon since ~ 2 days [00:20:38] interesting. that was ok before [00:23:34] 06Operations, 10Domains, 10Traffic, 06WMF-Legal: register .wiki gTLD domains - https://phabricator.wikimedia.org/T88873#2651081 (10BBlack) I see some refs to this ticket flying around, and recheck on some old DNS commits to add language/project domains under .wiki to our DNS, which were abandoned long ago.... [00:24:58] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 348, down: 2, shutdown: 0 [00:25:10] mutante: is it failing organically, or did some ip/hostname move and precipitate it? [00:27:16] bblack: organically, it's for apt.wm.org, which was there for a while [00:27:26] it tries to renew, and eventually this: [00:27:28] AttributeError: 'module' object has no attribute 'create_default_context' [00:27:39] started on weekend.. [00:28:26] stack trace? [00:28:44] though I have a feeling I know where that's from [00:29:25] https://phabricator.wikimedia.org/P4073 [00:29:31] nginx specific? [00:30:04] this is from code I added [00:30:16] it imports the python ssl module and does ctx = ssl.create_default_context() [00:30:41] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651084 (10Neil_P._Quinn_WMF) >>! In T135762#2506770, @ellery wrote: > In order to do statistical testing, you would need to compare the fraction of users who clicked the button... [00:30:43] mutante, what version of python does carbon have? [00:31:02] ooold.0 .. hold on [00:31:10] 2.7.3 [00:31:22] hopefully that won't be a problem much longer [00:31:52] it's to be replaced with installXXXXX [00:32:05] but still needs some puppet role work [00:32:19] looks like we may need python 2.7.9+ for my addition [00:32:44] can we add a "if > precise" around it for now? [00:32:55] and a reminder to remove it again when the carbon ticket is resolved [00:33:40] the code I added was to allow for HTTPS redirects during the acme-challenge check [00:34:11] specifically redirects to HTTPS using an invalid cert [00:34:20] aha [00:41:04] but we could still leave it as it was before, on carbon-only ? [00:41:51] mutante, yes, I've done some digging and I think earlier versions of python will default to the behaviour we want [00:41:58] cool! [00:42:00] I'll upload a patch [00:42:17] thank you for that redirect fix!:) [00:42:18] (It's not good behaviour in most circumstances, just in this particular one) [00:42:23] heh, ok [00:43:15] !log wtp2019 - down again, powercycled, probably damaged RAM [00:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:43:43] Release Date: 2014-12-10 [00:43:44] Python 2.7.9 is a bugfix version for the Python 2.7 release series. Python 2.7.9 includes several significant changes unprecedented in a "bugfix" release: [00:43:44] The entirety of Python 3.4's ssl module has been backported for Python 2.7.9. See PEP 466 for justification. [00:43:44] HTTPS certificate validation using the system's certificate store is now enabled by default. See PEP 476 for details. [00:43:50] yeah, should be fine [00:44:25] ah [00:45:11] RECOVERY - Host wtp2019 is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms [00:47:34] 06Operations, 10ops-codfw: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#2651101 (10Dzahn) [00:48:00] (03PS1) 10Alex Monk: letsencrypt: acme_tiny: Skip my ignore-HTTPS-validation code for old versions of python [puppet] - 10https://gerrit.wikimedia.org/r/311639 [00:48:17] should get brandon to review [00:48:25] ok, yes [00:49:38] (03PS2) 10Alex Monk: letsencrypt: acme_tiny: Skip my ignore-HTTPS-validation code for old versions of python [puppet] - 10https://gerrit.wikimedia.org/r/311639 [00:50:39] 06Operations, 10ops-codfw: wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#2651139 (10Dzahn) in syslog it just ends in the middle of normal operation and then starts again when it was powered up: 523 Sep 19 04:26:38 wtp2019 puppet-agent[39540]: Retrieving plugin 524 Sep 19 04:26:3... [00:56:24] ACKNOWLEDGEMENT - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-apt] daniel_zahn https://gerrit.wikimedia.org/r/311639 [00:57:31] 06Operations, 10Wikimedia-General-or-Unknown, 06Wikisource: Upgrade Ghostscript to 9.15 or later - https://phabricator.wikimedia.org/T110849#1588033 (10Dzahn) appservers are being upgraded to Debian jessie currently (tracking task T143536), but that will mean: 9.06~dfsg-2+deb8u1 so actually before 9.10... [01:02:06] (03PS1) 10RobH: mail.wikimedia.org cert expires on Thursday 2016-09-22 [puppet] - 10https://gerrit.wikimedia.org/r/311641 (https://phabricator.wikimedia.org/T144568) [01:05:06] 06Operations, 10Mail, 13Patch-For-Review: mx1001/2001 - Exim SMTP - Certificate expires Sep 22 2016 - https://phabricator.wikimedia.org/T144568#2651167 (10RobH) a:03faidon The certificate file is in patchset https://gerrit.wikimedia.org/r/311641 and the new.mail.wikimedia.org.key file in the private repo.... [01:12:49] 06Operations, 10Deployment-Systems, 13Patch-For-Review: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585#2651178 (10Dzahn) The last comment "Definitely better than hardcoding uids in the puppet tree." sounds like this ticket might be rejected? [01:14:58] 06Operations, 10Mail: status of wikigroup@ alias - https://phabricator.wikimedia.org/T127551#2651179 (10Dzahn) @bbogaert wanna try this again and add wikigroup@ to Doreen in Google? [01:15:40] 06Operations, 10Mail: status of fdcsupport@ ? - https://phabricator.wikimedia.org/T127548#2651180 (10Dzahn) @bbogaert maybe we can get back to to this if you have the time? [01:17:39] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Setup backups of elasticsearch indices - https://phabricator.wikimedia.org/T91404#1082103 (10EBernhardson) It's not quite exactly the same, but we have exports of the elasticsearch indices now at dumps.wikimedia.org. We can restore from the du... [01:19:03] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:31:13] (03PS1) 10Yuvipanda: labs: Add config for the puppet enc [puppet] - 10https://gerrit.wikimedia.org/r/311643 [01:32:44] (03CR) 10Yuvipanda: [C: 032] labs: Add config for the puppet enc [puppet] - 10https://gerrit.wikimedia.org/r/311643 (owner: 10Yuvipanda) [01:37:05] (03PS1) 10Yuvipanda: labs: Make puppet-enc read from the config file [puppet] - 10https://gerrit.wikimedia.org/r/311644 [01:38:55] (03CR) 10Yuvipanda: [C: 032] labs: Make puppet-enc read from the config file [puppet] - 10https://gerrit.wikimedia.org/r/311644 (owner: 10Yuvipanda) [01:39:35] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:44:17] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [02:04:40] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [02:28:04] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/50-server-status.conf] [02:33:25] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 10m 35s) [02:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:12] 06Operations, 10ops-requests: Apache RewriteRule for Extension:ShortUrl deployment - https://phabricator.wikimedia.org/T80309#2651245 (10Dzahn) [02:40:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 20 02:40:17 UTC 2016 (duration 6m 52s) [02:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:27] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:52:59] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [02:59:02] 06Operations: Patch bugzilla to restrict editbug permission - https://phabricator.wikimedia.org/T80310#2651258 (10Dzahn) [02:59:58] (03PS1) 10Alex Monk: Follow-up Ifa2cc187: Add ShortUrl support on wikimedia.org docroot sites [puppet] - 10https://gerrit.wikimedia.org/r/311647 (https://phabricator.wikimedia.org/T146014) [03:03:23] (03PS1) 10Alex Monk: Replace repeated UseMod rewrites in apache config with existing include [puppet] - 10https://gerrit.wikimedia.org/r/311648 [03:04:50] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [03:16:34] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:29:49] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:39:00] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:42:34] 06Operations, 10ops-requests: Aggregate nginx/squid/apache logs from payments cluster to silicon - https://phabricator.wikimedia.org/T80312#2651312 (10Dzahn) [03:42:57] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2651315 (10Dzahn) [03:44:17] 06Operations: Set up prototype Wikimedia blog in Labs - https://phabricator.wikimedia.org/T80313#2651316 (10Dzahn) [03:44:38] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2651265 (10Peachey88) Was this something we added as a custom, or is it a #phabricator-upstream issue? [03:44:49] 06Operations: process monitoring for aluminium/grosley - https://phabricator.wikimedia.org/T80314#2651321 (10Dzahn) [03:48:02] exit [04:04:05] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [05:05:28] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2651265 (10mmodell) It's a custom field but it's configured to be searchable, so I'm not sure why it isn't showing up in advanced search... [05:06:51] 06Operations, 10Mail: vfowler@wikimedia.org sending bounceback - https://phabricator.wikimedia.org/T146036#2651359 (10Peachey88) [06:04:57] (03PS1) 10Marostegui: db-eqiad.php: Repool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311655 (https://phabricator.wikimedia.org/T141951) [06:10:01] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:14:00] (03PS1) 10MarcoAurelio: Enable Extension:ShortURL on bd.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311656 (https://phabricator.wikimedia.org/T146014) [06:37:51] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:39:14] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2651265 (10Paladox) I guess we may want to update the extensions repo to extend search to re add that field. [06:59:41] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [07:02:20] 06Operations, 10Mail: vfowler@wikimedia.org sending bounceback - https://phabricator.wikimedia.org/T146036#2651427 (10akosiaris) With T146036 resolved, vfowler@wikimedia.org is now populated correctly at dubnium. @JGulingan, could you please test that everything works fine and resolve if yes? [07:04:34] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651429 (10ellery) @Neil_P._Quinn_WMF I'm saying that for any online AB test you to be able to group the experimental data by user. The proposed framework does not provide a mec... [07:06:02] tzdata ? [07:06:13] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add proxy-initial-not-pooled to frontends [puppet] - 10https://gerrit.wikimedia.org/r/311659 [07:06:40] <_joe_> akosiaris: ^^ [07:08:23] E: Problem renaming the file /var/cache/apt/pkgcache.bin.2eYdGE to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory) [07:08:45] 06Operations, 10Mail: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2651430 (10MoritzMuehlenhoff) We shouldn't extend the LDAP schema that way; the core schemas (like organizationalPerson) are standards which are not modified, if we want to use an additional attribute we... [07:08:54] <_joe_> akosiaris: ignorance is bliss, sometimes [07:10:53] _joe_: you can say that again... [07:11:02] I have no idea what happened there... [07:14:50] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:16:52] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2651433 (10ellery) @Nuria I certainly don't disagree that segmentation must be done at the user level. I'm saying that the test statistics (or metrics as you are calling them)... [07:19:03] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: add proxy-initial-not-pooled to frontends [puppet] - 10https://gerrit.wikimedia.org/r/311659 (owner: 10Giuseppe Lavagetto) [07:20:07] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add proxy-initial-not-pooled to frontends [puppet] - 10https://gerrit.wikimedia.org/r/311659 (owner: 10Giuseppe Lavagetto) [07:25:54] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:27:01] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:27:51] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:30:40] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:30:50] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:32:22] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:33:11] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [07:33:43] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:34:42] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:34:51] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:35:10] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:35:41] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:36:02] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:36:15] !log restart cassandra on aqs100[456] for T130861 - only aqs1004 is taking live traffic [07:36:17] T130861: Investigate and implement possible simplification of Cassandra Logstash filtering - https://phabricator.wikimedia.org/T130861 [07:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:12] 06Operations, 10Mail, 07LDAP: Add yubikey attribute to production ldap - https://phabricator.wikimedia.org/T146102#2651456 (10Peachey88) [07:37:52] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:39:02] PROBLEM - puppet last run on potassium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:39:41] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:40:01] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:40:10] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:40:23] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:44:02] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:44:50] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:44:52] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:45:40] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:47:12] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:48:31] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:53:03] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:53:11] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:53:21] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [07:56:30] ema: Sep 20 07:26:51 cp3036 varnishd[82167]: Error: (-sfile) allocation error: No space left on device [07:56:53] that was taken from sudo journalctl -ru varnish [07:56:55] akosiaris: thanks, looking [07:57:18] -sfile does indeed seem to have issues -spersistent did not [07:58:16] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:00:09] this looks like the varnish-backend-restart race condition with puppet we noticed yesterday [08:00:20] 26 7 * * * root /usr/local/sbin/varnish-backend-restart > /dev/null [08:00:24] 26,56 * * * * root /usr/local/sbin/puppet-run > /dev/null 2>&1 [08:02:41] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:07:59] ema: I might be able to find some time in the afternoon for it [08:08:51] volans: great, I'll start working on it now and keep you posted [08:09:05] ok [08:13:41] 06Operations, 10Monitoring, 06Release-Engineering-Team, 07Wikimedia-Incident: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2651497 (10hashar) [08:15:56] (03PS1) 10Alexandros Kosiaris: elasticsearch: Remove id_hash_mod.groovy [puppet] - 10https://gerrit.wikimedia.org/r/311665 [08:18:55] (03CR) 10DCausse: [C: 031] elasticsearch: Remove id_hash_mod.groovy [puppet] - 10https://gerrit.wikimedia.org/r/311665 (owner: 10Alexandros Kosiaris) [08:24:45] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:26:24] (03CR) 10ArielGlenn: More error logging/ sanity checks for dumpwikidata (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man) [08:27:14] 06Operations, 06Editing-Department, 10Monitoring, 06Release-Engineering-Team, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2651523 (10hashar) The graph from https://grafana.wikimedia.org/dashboard/db/authenti... [08:28:09] 06Operations, 06Performance-Team, 10Thumbor: Make the 100MB+ test files downloaded from their source instead of being in the git repo - https://phabricator.wikimedia.org/T145785#2651524 (10Gilles) [08:28:12] 06Operations, 06Performance-Team, 10Thumbor: Thumbor can't load source files bigger than 100MB - https://phabricator.wikimedia.org/T145768#2651525 (10Gilles) [08:33:31] 06Operations, 10Monitoring, 06Performance-Team, 06Release-Engineering-Team, 07Wikimedia-Incident: MediaWiki load time regression should trigger an alarm / page people - https://phabricator.wikimedia.org/T146125#2651529 (10hashar) [08:34:32] 06Operations, 10Monitoring, 06Release-Engineering-Team, 07Tracking, 07Wikimedia-Incident: Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942#2481685 (10hashar) [08:35:05] (03CR) 10ArielGlenn: [C: 031] "all the bits look right." [puppet] - 10https://gerrit.wikimedia.org/r/311473 (https://phabricator.wikimedia.org/T145788) (owner: 10Dzahn) [08:35:50] (03CR) 10ArielGlenn: [C: 031] admin: add samwalton9 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/311480 (https://phabricator.wikimedia.org/T145788) (owner: 10Dzahn) [08:36:56] 06Operations, 10Monitoring, 06Performance-Team, 06Release-Engineering-Team, 07Wikimedia-Incident: MediaWiki load time regression should trigger an alarm / page people - https://phabricator.wikimedia.org/T146125#2651529 (10hashar) [08:38:04] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2651580 (10Gilles) [08:38:07] 06Operations, 06Performance-Team, 10Thumbor: Report more metrics with statsd - https://phabricator.wikimedia.org/T145784#2651578 (10Gilles) 05Open>03Invalid Duh, I'm already reporting processing time and utime. [08:40:11] (03PS1) 10Gilles: Upgrade to 0.1.21 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/311668 [08:41:15] 06Operations: Reboot snapshot servers - https://phabricator.wikimedia.org/T146127#2651587 (10MoritzMuehlenhoff) [08:42:18] (03PS2) 10Gilles: Upgrade to 0.1.21 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/311668 [08:43:17] (03CR) 10ArielGlenn: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/311482 (https://phabricator.wikimedia.org/T145914) (owner: 10Dzahn) [08:43:29] 06Operations, 06Editing-Department, 10Monitoring, 06Release-Engineering-Team, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2651615 (10Tgr) centrallogin is not interesting, it can be added to web or just ignor... [08:46:55] (03CR) 10Gilles: [C: 04-1] "the new test_pdf2 fails during the debian build" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/311668 (owner: 10Gilles) [08:48:44] 06Operations, 10Datasets-General-or-Unknown: Reboot snapshot servers - https://phabricator.wikimedia.org/T146127#2651636 (10ArielGlenn) p:05Triage>03Normal [08:50:58] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:54:54] (03PS3) 10Gilles: Upgrade to 0.1.21 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/311668 [08:55:58] !log reimaging mw1243-mw1245 to jessie [08:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:20] (03PS1) 10Gilles: Add Thumbor config values moved out of package [puppet] - 10https://gerrit.wikimedia.org/r/311670 [08:59:28] (03PS2) 10Gilles: Add Thumbor config values moved out of package [puppet] - 10https://gerrit.wikimedia.org/r/311670 [09:09:27] (03CR) 10ArielGlenn: "excerpt from novaenv.sh on labcontrol1001:" [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) (owner: 10Alex Monk) [09:09:49] (03PS1) 10Ema: base: add run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/311671 [09:10:45] 06Operations, 10DBA, 13Patch-For-Review: Decomission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10Marostegui) As per our chat, db1019 can be decommissioned [09:12:02] (03CR) 10Ema: "The name sucks and I'm open for suggestions on that front as well. I wanted to call it pudo but couldn't find a justification for the 'u'." [puppet] - 10https://gerrit.wikimedia.org/r/311671 (owner: 10Ema) [09:13:34] !log reimaging API servers mw1192/mw1193 to jessie [09:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:11] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:23:41] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:26:08] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2651679 (10elukey) As @AlexMonk-WMF reported, we caused an issue when dealing with restbase configs: https://phabricator.wikimedia.org/T146053 As follo... [09:28:26] (03PS2) 10Ema: varnish: add varnish-fe restart script [puppet] - 10https://gerrit.wikimedia.org/r/311387 [09:29:15] (03CR) 10Ema: "> Should depool the nginx service, too" [puppet] - 10https://gerrit.wikimedia.org/r/311387 (owner: 10Ema) [09:35:48] (03PS2) 10Ema: base: add run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/311671 [09:38:32] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2651697 (10Aklapper) `maniphest.custom-field-definitions` and `maniphest.fields` in the config have `external_reference`. @Paladox: Can you please elaborate where exactly... [09:39:10] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:41:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2644948 (10elukey) Hi @debt, would you mind to add a bit more details about your use case? This will help us a lot narrowing... [09:46:47] (03CR) 10Elukey: [C: 031] admin: add samwalton9 to researchers [puppet] - 10https://gerrit.wikimedia.org/r/311480 (https://phabricator.wikimedia.org/T145788) (owner: 10Dzahn) [09:48:59] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:41] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2651722 (10Paladox) @Aklapper hi I essentially sure how they added this in the first place so I thought by extending the advanced search on the manifest we could we add this. [09:53:13] (03PS2) 10Giuseppe Lavagetto: puppetmaster: Ping before sending requests to backend [puppet] - 10https://gerrit.wikimedia.org/r/311457 (owner: 10Alexandros Kosiaris) [09:57:03] !log upgrading ganeti2002 to Linux 4.4 [09:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:02:14] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: Ping before sending requests to backend [puppet] - 10https://gerrit.wikimedia.org/r/311457 (owner: 10Alexandros Kosiaris) [10:04:42] (03CR) 10Giuseppe Lavagetto: [C: 032] Enable flake8 on Python 3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/307895 (owner: 10Legoktm) [10:05:21] (03CR) 10Giuseppe Lavagetto: [C: 032] Run tests on Python 3.4 and 3.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/307896 (owner: 10Legoktm) [10:08:29] (03CR) 10Giuseppe Lavagetto: [C: 032] Properly support 'basePath' [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 (owner: 10Legoktm) [10:09:13] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix documentation in README [software/service-checker] - 10https://gerrit.wikimedia.org/r/307911 (owner: 10Legoktm) [10:09:30] (03Merged) 10jenkins-bot: Enable flake8 on Python 3 [software/service-checker] - 10https://gerrit.wikimedia.org/r/307895 (owner: 10Legoktm) [10:09:33] (03Merged) 10jenkins-bot: Run tests on Python 3.4 and 3.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/307896 (owner: 10Legoktm) [10:09:34] (03Merged) 10jenkins-bot: Properly support 'basePath' [software/service-checker] - 10https://gerrit.wikimedia.org/r/307910 (owner: 10Legoktm) [10:10:01] (03Merged) 10jenkins-bot: Fix documentation in README [software/service-checker] - 10https://gerrit.wikimedia.org/r/307911 (owner: 10Legoktm) [10:12:18] !log change-prop deploying e1ef51e [10:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:13:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While logging is sorely needed, I would tweak this patch a bit." (034 comments) [software/service-checker] - 10https://gerrit.wikimedia.org/r/308019 (owner: 10Legoktm) [10:15:32] !log upgrading ganeti2003 to Linux 4.4 [10:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:16:09] (03CR) 10Faidon Liambotis: [C: 04-1] "The current certificate has SANs (DNS:mail.wikimedia.org, DNS:mx1001.wikimedia.org, DNS:mx1002.wikimedia.org, DNS:mx2001.wikimedia.org, DN" [puppet] - 10https://gerrit.wikimedia.org/r/311641 (https://phabricator.wikimedia.org/T144568) (owner: 10RobH) [10:16:47] 06Operations, 10Mail, 13Patch-For-Review: mx1001/2001 - Exim SMTP - Certificate expires Sep 22 2016 - https://phabricator.wikimedia.org/T144568#2651779 (10faidon) a:05faidon>03RobH This was issued incorrectly, see my comments on Gerrit. [10:20:34] (03CR) 10Giuseppe Lavagetto: [C: 032] Allow the output to be in YAML format [software/conftool] - 10https://gerrit.wikimedia.org/r/288632 (owner: 10Mobrovac) [10:21:56] (03PS3) 10Giuseppe Lavagetto: Allow service-checker to read YAML-formatted specs [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) (owner: 10Mobrovac) [10:22:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Allow service-checker to read YAML-formatted specs [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) (owner: 10Mobrovac) [10:23:17] (03Merged) 10jenkins-bot: Allow service-checker to read YAML-formatted specs [software/service-checker] - 10https://gerrit.wikimedia.org/r/306707 (https://phabricator.wikimedia.org/T136839) (owner: 10Mobrovac) [10:25:11] !log deploying schema change on s1 hosts T139090 [10:25:12] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [10:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:25:56] ^marostegui [10:26:55] it will likely complain of lag on dbstores/labs (toku) [10:27:10] but not until tomorrow [10:28:26] !log force mw2232 to use palladium for report handler testing [10:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:04] (03PS2) 10Giuseppe Lavagetto: Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 [10:32:06] (03PS2) 10Giuseppe Lavagetto: Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 [10:33:02] (03CR) 10jenkins-bot: [V: 04-1] Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 (owner: 10Giuseppe Lavagetto) [10:33:19] (03CR) 10jenkins-bot: [V: 04-1] Add schema support [software/conftool] - 10https://gerrit.wikimedia.org/r/288881 (owner: 10Giuseppe Lavagetto) [10:33:26] 06Operations: Migrate Graphana dashboard "labs-project-board" from prod to labs Graphana instance - https://phabricator.wikimedia.org/T146136#2651804 (10hashar) [10:33:28] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:12] <_joe_> hashar: see https://integration.wikimedia.org/ci/job/tox-jessie/11577/console - conftool flake8 going crazy [10:36:23] !log upgrading ganeti2004 to Linux 4.4 [10:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:05] (03PS2) 10Marostegui: db-eqiad.php: Repool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311655 (https://phabricator.wikimedia.org/T141951) [10:37:54] (03PS1) 10Giuseppe Lavagetto: Have confctl exit with status code 1 if one action fails. [software/conftool] - 10https://gerrit.wikimedia.org/r/311678 [10:38:13] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Repool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311655 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [10:38:52] (03CR) 10jenkins-bot: [V: 04-1] Have confctl exit with status code 1 if one action fails. [software/conftool] - 10https://gerrit.wikimedia.org/r/311678 (owner: 10Giuseppe Lavagetto) [10:39:40] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:39:58] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:40:24] (03PS1) 10Jcrespo: packages_wmf: force custom handling of mysqld_safe [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311679 (https://phabricator.wikimedia.org/T145378) [10:41:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311655 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [10:41:53] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311655 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [10:43:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: (no message) (duration: 00m 48s) [10:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:44:31] !log reimaging app servers mw1240-mw1242 and API servers mw1194/mw1195 to jessie [10:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:11] !log upgrading ganeti2005 to Linux 4.4 [10:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:15] 06Operations, 10Mail: Delivery failed to eng-admin - https://phabricator.wikimedia.org/T145800#2651842 (10faidon) @MoritzMuehlenhoff / @bbogaert, what's the OID that you plan on using for the Yubikey attribute? Is there one publically available already or should we create one? If so, I believe that we do have... [10:56:51] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:57:52] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: Puppet has 13 failures. Last run 6 minutes ago with 13 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP],File[/usr/local/bin/etcd-manage],File[/usr/local/bin/furl],File[/etc/rsyslog.d] [11:00:01] 06Operations, 10Mail: Delivery failed to eng-admin - https://phabricator.wikimedia.org/T145800#2651843 (10MoritzMuehlenhoff) See my comments at https://phabricator.wikimedia.org/T146102#2651430 The yubikey attribute is not from a generic yubico schema, but was picked by Byron when adding yubico-pam for the VP... [11:03:24] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:05:21] !log upgrading ganeti2006 to Linux 4.4 [11:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:37] (03PS6) 10Elukey: Improve resilience during varnish (re)starts [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/311415 (https://phabricator.wikimedia.org/T138747) [11:06:44] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:09:05] (03PS1) 10Marostegui: db-eqiad.php: Temporarily depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311680 (https://phabricator.wikimedia.org/T141951) [11:11:26] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651863 (10jcrespo) @Andrew @dpatrick do you think it is wise that labtestweb2001 had production passwords? [11:12:52] (03PS1) 10Elukey: Add jobrunner02 to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/311681 (https://phabricator.wikimedia.org/T144006) [11:14:32] (03CR) 10Hashar: [C: 031] Add jobrunner02 to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/311681 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [11:16:02] (03CR) 10Elukey: [C: 032] Add jobrunner02 to deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/311681 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [11:18:16] (03CR) 10Jcrespo: db-eqiad.php: Temporarily depool db1079 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311680 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:20:41] (03CR) 10Jcrespo: [C: 032] packages_wmf: force custom handling of mysqld_safe [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311679 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [11:22:03] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 6 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:22:21] (03PS2) 10Marostegui: db-eqiad.php: Temporarily depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311680 (https://phabricator.wikimedia.org/T141951) [11:22:24] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:57] (03PS1) 10Jcrespo: mariadb: implement custom mysqld_safe script to all servers [puppet] - 10https://gerrit.wikimedia.org/r/311682 (https://phabricator.wikimedia.org/T145378) [11:27:13] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:28:19] (03PS2) 10Jcrespo: mariadb: implement custom mysqld_safe script to all servers [puppet] - 10https://gerrit.wikimedia.org/r/311682 (https://phabricator.wikimedia.org/T145378) [11:33:47] (03CR) 10Jcrespo: db-eqiad.php: Temporarily depool db1079 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311680 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:34:49] (03PS3) 10Marostegui: db-eqiad.php: Temporarily depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311680 (https://phabricator.wikimedia.org/T141951) [11:38:04] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Temporarily depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311680 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:38:27] !log upgrading ganeti2001 to Linux 4.4 (ganeti2006 has been promoted to new master node) [11:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:38:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Temporarily depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311680 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:39:14] (03Merged) 10jenkins-bot: db-eqiad.php: Temporarily depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311680 (https://phabricator.wikimedia.org/T141951) (owner: 10Marostegui) [11:39:49] (03PS2) 10Giuseppe Lavagetto: Have confctl exit with status code 1 if one action fails. [software/conftool] - 10https://gerrit.wikimedia.org/r/311678 [11:40:11] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/4127/" [puppet] - 10https://gerrit.wikimedia.org/r/311682 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [11:40:46] (03CR) 10jenkins-bot: [V: 04-1] Have confctl exit with status code 1 if one action fails. [software/conftool] - 10https://gerrit.wikimedia.org/r/311678 (owner: 10Giuseppe Lavagetto) [11:41:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: (no message) (duration: 00m 46s) [11:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:48:12] <_joe_> hashar: again, flake8 testing all files in site-packages [11:48:40] <_joe_> https://integration.wikimedia.org/ci/job/tox-jessie/11579/console [11:49:35] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651883 (10Krenair) Which production passwords are you referring to? [11:57:07] (03CR) 10Alex Monk: "Ah, keystone v3." [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) (owner: 10Alex Monk) [11:57:35] (03CR) 10Marostegui: [C: 031] "So far all the tests we have done in a couple of servers looked good, so I am happy with this change!" [puppet] - 10https://gerrit.wikimedia.org/r/311682 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [11:59:09] (03PS2) 10Alex Monk: openstack: Update monitor_labs_salt_keys.py for new Nova API version [puppet] - 10https://gerrit.wikimedia.org/r/309709 (https://phabricator.wikimedia.org/T123607) [12:00:40] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2651889 (10AlexMonk-WMF) >>! In T144006#2651679, @elukey wrote: > As @AlexMonk-WMF reported, we caused an issue when dealing with restbase configs: http... [12:01:50] (03PS2) 10Bmansurov: Blacklist minerva from showing Related Articles in the footer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311197 (https://phabricator.wikimedia.org/T144912) [12:03:28] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2651892 (10elukey) >>! In T144006#2651889, @AlexMonk-WMF wrote: >>>! In T144006#2651679, @elukey wrote: >> As @AlexMonk-WMF reported, we caused an issue... [12:06:08] (03PS1) 10Marostegui: db-eqiad.php: Temporarily depool db1086. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311685 (https://phabricator.wikimedia.org/T141951) [12:07:44] (03PS1) 10Hashar: package_builder: doc that gbp must be passed -sa [puppet] - 10https://gerrit.wikimedia.org/r/311686 (https://phabricator.wikimedia.org/T145797) [12:10:15] (03CR) 10Hashar: "That is from last week when both Alexandros and Elukey noticed that my .changes were missing the original tarball. Got it fixed definitel" [puppet] - 10https://gerrit.wikimedia.org/r/311686 (https://phabricator.wikimedia.org/T145797) (owner: 10Hashar) [12:13:30] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 07Nodepool: Upgrade Nodepool to 0.1.1-wmf5 to reduce requests made to OpenStack API - https://phabricator.wikimedia.org/T145142#2651917 (10hashar) I have refreshed the package on https://people.wikimedia.org/~hashar/debs/nodepool_0.1.1-wmf5/ .... [12:21:13] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:27:58] (03CR) 10Jcrespo: [C: 032] mariadb: implement custom mysqld_safe script to all servers [puppet] - 10https://gerrit.wikimedia.org/r/311682 (https://phabricator.wikimedia.org/T145378) (owner: 10Jcrespo) [12:32:26] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2651946 (10fgiunchedi) @gilles no I don't know if the processes were going on for more than a minute, though I'm not sure those ffmpeg are the root cause. Judging by th... [12:32:40] (03PS2) 10Muehlenhoff: Always refresh Package/Release/Translation files [puppet] - 10https://gerrit.wikimedia.org/r/311392 [12:34:54] (03PS1) 10BBlack: nginx (1.11.4-1+wmf1) experimental; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/311689 [12:40:15] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2651952 (10Marostegui) This server has been running for 24h with no issues so far, reported. I would like to pool it in tomorrow with some weight (not much) to see how it starts coping with production traffic. Any though... [12:40:48] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2651953 (10jcrespo) +1 [12:45:40] PROBLEM - puppet last run on db1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:45:48] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651955 (10elukey) >>! In T145360#2651883, @Krenair wrote: > Which production passwords are you referring to? >>! In T145360#2627669, @jcrespo wrote: > I am not sure wikitech should be reachable by terbium maintenance, and less by... [12:46:10] (03PS3) 10BBlack: letsencrypt: acme_tiny: Skip my ignore-HTTPS-validation code for old versions of python [puppet] - 10https://gerrit.wikimedia.org/r/311639 (owner: 10Alex Monk) [12:46:19] (03CR) 10BBlack: [C: 032 V: 032] letsencrypt: acme_tiny: Skip my ignore-HTTPS-validation code for old versions of python [puppet] - 10https://gerrit.wikimedia.org/r/311639 (owner: 10Alex Monk) [12:47:10] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:47:37] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651958 (10Krenair) Oh, so you want to change the puppet manifests around so silver can run those jobs for itself, and then change the mysql password and have it use a different one of those too? [12:48:38] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651959 (10jcrespo) >>>! In T145360#2627669, @jcrespo wrote: >> I am not sure wikitech should be reachable by terbium maintenance, and less by a production credential user like wikiadmin. Labswiki is not a production-core wiki, and i... [12:50:15] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651960 (10Krenair) You know those credentials are pushed to every (wmnet) MW server by scap, right? [12:52:13] hashar: only one config change from Urbanecm for eu swat today, should I deploy it? [12:52:19] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651961 (10elukey) For the 'why now' part: could it be due to the fact that the script stopped before reaching this point? It started to occur IIRC right after https://gerrit.wikimedia.org/r/#/c/309616/ was merged. [12:52:22] zeljkof: yeah I guess :] [12:52:53] I'm available so if it is possible, it can be deployed before SWAT time. [12:53:21] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:54:04] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651963 (10jcrespo) > Oh, so you want to change the puppet manifests around so silver can run those jobs for itself, and then change the mysql password and have it use a different one of those too? I do not have a solid proposal, bu... [12:54:14] back in a while, running an errand [12:54:27] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651964 (10Krenair) The script used to break before reaching this point, https://gerrit.wikimedia.org/r/#/c/309616/ fixed that. Now it's just broken by something that we already got fixed for labswiki, just not (yet?) labtestwiki [12:54:48] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.21 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/311668 (owner: 10Gilles) [12:55:41] PROBLEM - Unmerged changes on repository puppet on puppetmaster1002 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:55:49] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:55:49] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:56:01] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:56:29] hashar, Urbanecm: looking at the patch, it will take a few minutes for CI, so the deployment can be at the start of swat window sharp [12:56:53] Okay. [12:57:40] Urbanecm: can you test the change at mw1099? [12:59:22] If there is any way to force reload of Special:Statistics, then yes. [12:59:45] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2651978 (10jcrespo) >>! In T145360#2651964, @Krenair wrote: > The script used to break before reaching this point, https://gerrit.wikimedia.org/r/#/c/309616/ fixed that. Now it's just broken by something that we already got fixed for... [13:00:04] hashar, Dereckson, addshore, and aude: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160920T1300). Please do the needful. [13:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:16] Thanks, jouncebot, we've already started :) [13:00:41] (03PS3) 10Zfilipin: Change $wgArticleCountMethod in Wikidata from default ('link') to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308430 (https://phabricator.wikimedia.org/T144687) (owner: 10Urbanecm) [13:01:06] I can SWAT today! :D [13:01:10] Urbanecm: zeljkof: just sync that change to the whole cluster [13:01:23] hashar: ok, will be done [13:01:26] pretty sure the stats will have to be manually refreshed [13:01:30] hashar: What about the regenerating? [13:01:40] Do we have a script for it or something? [13:01:43] dig in the other similar tasks https://gerrit.wikimedia.org/r/#/c/308430/3/wmf-config/InitialiseSettings.php [13:01:53] and hopeully would get the script to run :D [13:02:19] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308430 (https://phabricator.wikimedia.org/T144687) (owner: 10Urbanecm) [13:02:44] (03Merged) 10jenkins-bot: Change $wgArticleCountMethod in Wikidata from default ('link') to 'any' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308430 (https://phabricator.wikimedia.org/T144687) (owner: 10Urbanecm) [13:03:20] dry run is : mwscript maintenance/updateArticleCount.php --wiki=wikidatawiki [13:03:30] if happy, pass --update [13:04:09] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:05:33] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:308430|Change $wgArticleCountMethod in Wikidata from default (link) to any (T144687)]] (duration: 00m 47s) [13:05:34] T144687: Change $wgArticleCountMethod in Wikidata from default ('link') to 'any' - https://phabricator.wikimedia.org/T144687 [13:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:01] Urbanecm: deployed [13:06:07] And what about the refresh? [13:06:29] Urbanecm: is that a question for me? [13:06:39] I don't know, how is the refresh done? [13:06:48] dry run is : mwscript maintenance/updateArticleCount.php --wiki=wikidatawiki [13:06:53] if happy, pass --update [13:07:04] Source: hashar 's message a few rows upper :D [13:07:06] zeljkof: run it in terbium on a tmux/screen, as it's a large wiki [13:07:10] Hi. [13:07:20] Urbanecm: thanks, just saw it [13:07:27] You're welcome. [13:07:44] Dereckson: do we have docs for that? :D [13:07:57] (03PS1) 10Alexandros Kosiaris: puppet-merge: submodule diff should honor QUIET [puppet] - 10https://gerrit.wikimedia.org/r/311691 [13:08:11] Run screen, it'll run new bash screen. In this screen please run the commands mentoined upper. [13:08:42] Urbanecm: I see, thanks [13:08:58] yeah, but take "tmux" if you don't use already one of them, it's a more modern application [13:09:00] If SSH connection will fail for any reason (except the server will be turned off or something like it), the screen will still run :) [13:09:17] they you can reettach with "tmux attach" or "screen -x" [13:09:20] Yes. [13:09:20] RECOVERY - puppet last run on db1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:09:50] So it'll safely complete and there is lower risk. [13:10:00] What is the php version number in prod nodes? 5.4 or 5.5? [13:10:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppet-merge: submodule diff should honor QUIET [puppet] - 10https://gerrit.wikimedia.org/r/311691 (owner: 10Alexandros Kosiaris) [13:10:13] https://en.wikipedia.org/wiki/Special:Version doesn't return anything useful [13:11:08] Dereckson, Urbanecm: thanks, I did use tmux some time ago, but I do not use it regularly [13:11:22] Amir1: we use hhvm [13:11:33] just reading the docs, it is actually documented how to run maintenance script [13:11:34] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#SSH_Connections.2FCommands [13:11:40] https://wikitech.wikimedia.org/wiki/Terbium [13:11:57] Amir1: a former version, HHVM 3.12, currently 3.12.7 [13:12:13] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Run_a_maintenance_script_on_a_wiki [13:12:24] Urbanecm, check related proposal https://phabricator.wikimedia.org/T144661 [13:12:45] goodmorninnngngng [13:12:52] Yeah thanks but I want php version number. It seems my patch works in my local host (php7) but not in jenkins [13:13:12] godog: you around today? [13:13:13] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 602 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4309080 keys - replication_delay is 602 [13:13:44] jynus: I don't think this is useful for me because I have no shell access to any production server... [13:14:02] ottomata: hey, yes I am [13:14:10] well, you were teaching someone that probably has :-) [13:14:15] jynus: updateArticleCount is < 1 hour [13:14:21] Dereckson, ok [13:14:28] I thought it was longer [13:15:13] Urbanecm: the script is stuck at "Counting articles..." for a while now, I guess it will take some time [13:15:30] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2652011 (10Krenair) Off the top of my head, the difference between this server and silver, beyond that it lives in codfw and has 'test' in it's name, is pretty much just that it also runs services that would live on californium, and... [13:15:33] some dozens of minutes yes [13:15:48] 22000000 items isn't small amount... [13:17:00] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:03] hey bblack [13:17:11] the acme_tiny patch is merged but puppet still fails on carbon, according to icinga [13:17:12] godog: cool, let's merge https://gerrit.wikimedia.org/r/#/c/300548/ [13:17:14] ja? [13:17:18] any idea what error it's running into now? [13:17:21] (03PS5) 10Ottomata: Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) [13:17:32] Krenair: it might not really be merged? are you sure the code updated on carbon? [13:17:58] I honestly don't know, with no access to carbon or the puppetmaster the only thing I can know is that it's merged in gerrit [13:18:27] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4304250 keys - replication_delay is 0 [13:19:20] RECOVERY - Unmerged changes on repository puppet on puppetmaster1002 is OK: No changes to merge. [13:19:31] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:19:31] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. [13:19:39] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [13:19:40] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [13:20:01] The patch has been merged in gerrit for about half an hour now, usually at this stage it would already have been deployed to the servers [13:20:18] although, those unmerged changes recoveries... [13:20:52] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, though check_graphite.cfg needs --until not -until" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [13:20:59] ottomata: yeah, ^ last change and gtg [13:21:16] yikes nice catch [13:21:26] urandom: o/ let me know when you have a min :) [13:21:44] Krenair: yeah there was a puppet-merge issue that's resolved now [13:22:08] ah, so that might have been holding it from getting deployed? [13:22:12] running agent on carbon now [13:23:26] Notice: /Stage[main]/Install_server::Web_server/Letsencrypt::Cert::Integrated[apt]/Exec[acme-setup-acme-apt]/returns: TypeError: urlopen() got an unexpected keyword argument 'context' [13:23:35] Amir1: so it supports most of 5.6 features, but not all [13:23:36] Krenair: ^ yeah so needs slightly more workaround [13:23:57] okay, followup patch coming up [13:23:57] zeljkof: What about the script? [13:24:06] Urbanecm: still running... [13:24:13] Dereckson: thanks. I go with 5.5, I think that's a safe bet [13:24:25] (03PS6) 10Ottomata: Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) [13:24:30] zeljkof: Okay. Is there any estimated time? [13:24:46] Urbanecm: this is the first time I am running a script... [13:24:59] (during deployment) [13:25:02] I have no clue [13:25:09] Okay, in another words. Does the scripts print any estimated time? [13:25:26] Amir1: MediaWiki itself is a PHP 5.5.9+ compatible application [13:25:28] Urbanecm: no, literally only "Counting articles..." [13:25:40] zeljkof: Okay, thanks. [13:26:03] (03CR) 10Alexandros Kosiaris: [C: 032] elasticsearch: Remove id_hash_mod.groovy [puppet] - 10https://gerrit.wikimedia.org/r/311665 (owner: 10Alexandros Kosiaris) [13:26:07] (03PS2) 10Alexandros Kosiaris: elasticsearch: Remove id_hash_mod.groovy [puppet] - 10https://gerrit.wikimedia.org/r/311665 [13:26:12] (03CR) 10Alexandros Kosiaris: [V: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/311665 (owner: 10Alexandros Kosiaris) [13:26:22] Amir1: so if your code should be reused by other MediaWiki users, 5.5 is best, yes [13:26:36] yes [13:26:40] that's correct [13:27:14] (03PS1) 10Muehlenhoff: Create a new LDAP schema extension for custom user attributes [puppet] - 10https://gerrit.wikimedia.org/r/311694 (https://phabricator.wikimedia.org/T146102) [13:27:19] ok godog, merging [13:27:19] (03PS1) 10Alex Monk: Follow-up I55032bf7: urlopen context argument was added in Python 2.7.9 too [puppet] - 10https://gerrit.wikimedia.org/r/311695 [13:27:21] bblack, ^ [13:27:39] Urbanecm: ok, finished [13:27:49] zeljkof: Thanks. [13:28:00] said: Counting articles... found 23840309. [13:28:07] To update the site statistics table, run the script with the --update option. [13:28:07] ottomata: the submodule change is back btw :) [13:28:16] BAH [13:28:18] !log restarting for elasticsearch and kernel upgrade - eqiad cluster - T145404 / T146123 [13:28:19] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2652031 (10Gilles) [13:28:20] T145404: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404 [13:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:27] I think you can pass --update... [13:28:27] haa [13:28:27] Urbanecm: so now approximately 10-20 minutes for --update to finish [13:28:35] Okay. [13:29:36] Urbanecm: running: mwscript maintenance/updateArticleCount.php --wiki=wikidatawiki --update [13:29:43] Okay. ) [13:29:52] please stand by, your patch is important to us... [13:29:56] (03PS7) 10Ottomata: Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) [13:30:06] I'm not going anywhere :) [13:30:21] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:31:09] (03CR) 10Filippo Giunchedi: [C: 031] Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [13:31:43] Urbanecm: that was quick, finished already [13:31:55] can you check if the update worked fine? [13:31:56] (03PS3) 10Filippo Giunchedi: Add Thumbor config values moved out of package [puppet] - 10https://gerrit.wikimedia.org/r/311670 (owner: 10Gilles) [13:31:58] Maybe due to cache, I don't know. [13:32:01] ACKNOWLEDGEMENT - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.011 second response time Filippo Giunchedi thumbor missing config [13:32:09] Sure, working on it... [13:32:20] ah, thumbor [13:32:21] sorry about the page, that's me [13:32:27] maintenance, I assume? [13:32:30] I didn't expect the ACK to page heh [13:32:32] zeljkof: What number was written by the dry one? [13:32:34] (03CR) 10Ottomata: [C: 032] Finish adding --until param to check_graphite script [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [13:32:42] godog, happened to me once [13:32:43] *by the dry run I mean [13:32:50] Urbanecm: this one? "found 23840309" [13:33:02] Yes, thanks. [13:33:05] jynus: yeah very surprising [13:33:11] -- update said: Counting articles...found 23840312. Updating site statistics table... done. [13:33:16] (03PS2) 10BBlack: Follow-up I55032bf7: urlopen context argument was added in Python 2.7.9 too [puppet] - 10https://gerrit.wikimedia.org/r/311695 (owner: 10Alex Monk) [13:33:20] (03CR) 10BBlack: [C: 032 V: 032] Follow-up I55032bf7: urlopen context argument was added in Python 2.7.9 too [puppet] - 10https://gerrit.wikimedia.org/r/311695 (owner: 10Alex Monk) [13:33:29] Okay, then it is okay. [13:33:30] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [13:33:40] Special:Statistics says 23 840 315 [13:33:52] Urbanecm: ok, in that case we are done [13:33:56] (03PS4) 10Filippo Giunchedi: Add Thumbor config values moved out of package [puppet] - 10https://gerrit.wikimedia.org/r/311670 (owner: 10Gilles) [13:34:02] Thanks for your deploy zeljkof ! [13:34:18] Urbanecm: I am glad I could help :) [13:34:28] !log EU SWAT finished [13:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add Thumbor config values moved out of package [puppet] - 10https://gerrit.wikimedia.org/r/311670 (owner: 10Gilles) [13:35:06] (03PS1) 10Alexandros Kosiaris: Update README.md [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/311697 [13:35:10] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [13:35:40] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [13:35:48] godog: ^^^ [13:36:00] Okay zeljkof, so good bye! [13:36:23] zeljkof: log also the script run [13:36:40] volans: yeah, should recover soon [13:36:45] Dereckson: will do, thanks, did not know I should do that [13:36:47] (03CR) 10Alexandros Kosiaris: [C: 032] Update README.md [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/311697 (owner: 10Alexandros Kosiaris) [13:37:22] !log executed script: mwscript maintenance/updateArticleCount.php --wiki=wikidatawiki --update [13:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:37:31] Dereckson: is that good? ^ [13:37:38] yep [13:37:49] Dereckson: thanks for the reminder! [13:37:59] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [13:38:20] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [13:38:27] Krenair: Notice: /Stage[main]/Install_server::Web_server/Letsencrypt::Cert::Integrated[apt]/Exec[acme-setup-acme-apt]/returns: executed successfully [13:38:30] \o/ [13:38:33] (03PS1) 10Alexandros Kosiaris: Update varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/311698 [13:39:00] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [13:39:47] (03CR) 10Alexandros Kosiaris: [C: 032] Update varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/311698 (owner: 10Alexandros Kosiaris) [13:39:56] (03PS1) 10BBlack: ciphersuite: remove chapoly draft-mode ciphers [puppet] - 10https://gerrit.wikimedia.org/r/311700 [13:40:03] (03PS1) 10Volans: Reimage: minor improvements [puppet] - 10https://gerrit.wikimedia.org/r/311701 (https://phabricator.wikimedia.org/T143536) [13:40:07] !log merged --until flag change in check_graphite script (this could affect all graphite based alerts) [13:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:36] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:41:04] !log disabling puppet on labtestweb2001 [13:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:16] (03PS2) 10BBlack: ciphersuite: remove chapoly draft-mode ciphers [puppet] - 10https://gerrit.wikimedia.org/r/311700 [13:43:50] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:44:09] (03PS1) 10Hashar: jobrunner: fix rsyslog for jobchron service [puppet] - 10https://gerrit.wikimedia.org/r/311702 (https://phabricator.wikimedia.org/T146040) [13:46:49] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [13:47:04] (03PS1) 10Alexandros Kosiaris: Update varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/311703 [13:48:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Update varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/311703 (owner: 10Alexandros Kosiaris) [13:49:38] 06Operations, 10DBA: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2652106 (10MoritzMuehlenhoff) Yeah, let's repool and check whether it happens again. [13:50:36] (03PS2) 10Gehel: Use 30g of heap for relforge jvms [puppet] - 10https://gerrit.wikimedia.org/r/311377 (owner: 10DCausse) [13:50:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/311686 (https://phabricator.wikimedia.org/T145797) (owner: 10Hashar) [13:50:45] (03PS2) 10Alexandros Kosiaris: package_builder: doc that gbp must be passed -sa [puppet] - 10https://gerrit.wikimedia.org/r/311686 (https://phabricator.wikimedia.org/T145797) (owner: 10Hashar) [13:50:47] (03CR) 10Alexandros Kosiaris: [V: 032] package_builder: doc that gbp must be passed -sa [puppet] - 10https://gerrit.wikimedia.org/r/311686 (https://phabricator.wikimedia.org/T145797) (owner: 10Hashar) [13:52:55] (03PS3) 10Gehel: Use 30g of heap for relforge jvms [puppet] - 10https://gerrit.wikimedia.org/r/311377 (owner: 10DCausse) [13:53:49] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:01] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/opt/wmf-mariadb10/bin/mysqld_safe] [13:55:11] (03CR) 10Gehel: [C: 032] Use 30g of heap for relforge jvms [puppet] - 10https://gerrit.wikimedia.org/r/311377 (owner: 10DCausse) [13:57:11] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:57:40] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:51] ottomata: btw did you see the neon icinga config failure? [14:01:04] 06Operations, 10Mail, 13Patch-For-Review: mx1001/2001 - Exim SMTP - Certificate expires Sep 22 2016 - https://phabricator.wikimedia.org/T144568#2652115 (10RobH) I'll fix this today! I just emailed our rep to get the cert revoked/refunded for reissue with the proper SANS. [14:02:39] (03CR) 10Volans: "LGTM, some comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311671 (owner: 10Ema) [14:02:56] godog: no [14:03:02] i checked it [14:03:04] it said it was ok [14:03:25] * checking /usr/sbin/icinga... [ OK ] [14:04:05] ottomata: odd, I'm seeing the alert in icinga Icinga configuration contains errors [14:04:20] rnning verify [14:04:33] oh [14:04:33] Error: Service check command 'check_graphite_threshold_until_temp' specified in service 'Varnishkafka Delivery Errors per minute' for host 'cp3049' not defined anywhere! [14:04:43] puppet needs to run on a bunch of hosts [14:04:52] to update the stored db [14:05:14] salting... [14:05:41] thanks! [14:08:49] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:10:40] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [14:11:57] jynus: i'll be performing the beta cluster data migration this morning. you around just in case? [14:12:14] marxarelli, yes [14:12:25] rad. thanks! [14:14:25] PROBLEM - Zookeeper Alive Client Connections too high on conf2003 is CRITICAL: (null) [14:15:00] mmm this is weird [14:16:38] metrics are good, maybe something related to check_graphite_threshold_until_temp? [14:18:00] what is that null elukey ? [14:18:25] so this thing is a monitoring::graphite_threshold { 'zookeeper-client-connections': [14:18:33] and check_graphite_threshold_until_temp was just updated [14:18:55] https://gerrit.wikimedia.org/r/#/c/300548/ [14:19:06] ottomata: --^ zk alerts probably due to the patch? [14:19:53] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:21:10] elukey: probably, i think puppet needs to run there [14:21:11] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:21:15] i just ran it on a bunch of varnish hosts [14:21:17] yea ^^ [14:21:24] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:21:26] its a icinga stored db lag thing [14:22:24] nice :) [14:23:50] !log installing tomcat security updates on Ubuntu servers [14:23:53] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:18] !log restarting relforge100[12].eqiad.wmnet servers for kernel upgrade and java settings change [14:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:52] RECOVERY - Zookeeper Alive Client Connections too high on conf2003 is OK: OK: Less than 1.00% above the threshold [512.0] [14:33:13] (03Abandoned) 10BBlack: add VCL variable "uptime" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/311632 (owner: 10BBlack) [14:35:04] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:37:31] (03PS3) 10BBlack: ciphersuite: remove chapoly draft-mode ciphers [puppet] - 10https://gerrit.wikimedia.org/r/311700 [14:37:47] (03CR) 10BBlack: [C: 032 V: 032] ciphersuite: remove chapoly draft-mode ciphers [puppet] - 10https://gerrit.wikimedia.org/r/311700 (owner: 10BBlack) [14:39:50] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2652206 (10debt) Hi @elukey - I probably don't need fluorine either. I'd like to be able to dig into some of the data for m... [14:40:03] (03PS1) 10Muehlenhoff: mira02 "reimaged" as deployment-mira02 [puppet] - 10https://gerrit.wikimedia.org/r/311710 (https://phabricator.wikimedia.org/T144006) [14:41:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 682 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4310426 keys - replication_delay is 682 [14:42:19] (03CR) 10Muehlenhoff: [C: 032] mira02 "reimaged" as deployment-mira02 [puppet] - 10https://gerrit.wikimedia.org/r/311710 (https://phabricator.wikimedia.org/T144006) (owner: 10Muehlenhoff) [14:42:26] (03PS2) 10Muehlenhoff: mira02 "reimaged" as deployment-mira02 [puppet] - 10https://gerrit.wikimedia.org/r/311710 (https://phabricator.wikimedia.org/T144006) [14:42:58] (03CR) 10Muehlenhoff: [V: 032] mira02 "reimaged" as deployment-mira02 [puppet] - 10https://gerrit.wikimedia.org/r/311710 (https://phabricator.wikimedia.org/T144006) (owner: 10Muehlenhoff) [14:44:23] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2652239 (10elukey) >>! In T145914#2652206, @debt wrote: > Hi @elukey - I probably don't need fluorine either. > > I'd like... [14:44:27] (03PS1) 10Rush: labstore: assign IP for secondary interface [dns] - 10https://gerrit.wikimedia.org/r/311713 (https://phabricator.wikimedia.org/T144183) [14:44:51] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:51] (03PS1) 10Jcrespo: mariadb: Disable /root/.my.cnf distribution [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311715 (https://phabricator.wikimedia.org/T146146) [14:49:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4300258 keys - replication_delay is 0 [14:49:09] (03CR) 10Mobrovac: [C: 031] RESTBase: Specify the topic for transclusions. [puppet] - 10https://gerrit.wikimedia.org/r/311594 (https://phabricator.wikimedia.org/T145804) (owner: 10Ppchelko) [14:51:32] (03PS1) 10Elukey: Remove jobrunner01 from deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/311717 (https://phabricator.wikimedia.org/T144006) [14:53:24] (03PS1) 10Hashar: jobrunner: refactor rsyslog conf and let wikidev read log [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) [14:55:13] (03CR) 10Jcrespo: [C: 032] mariadb: Disable /root/.my.cnf distribution [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311715 (https://phabricator.wikimedia.org/T146146) (owner: 10Jcrespo) [14:55:17] 06Operations, 10MediaWiki-JobRunner, 07Beta-Cluster-reproducible, 13Patch-For-Review: wikidev people cant read /var/log/mediawiki/jobrunner.log - https://phabricator.wikimedia.org/T146040#2652251 (10hashar) [14:56:01] (03PS1) 10Jcrespo: mariadb: Disable /root/.my.cnf distribution [puppet] - 10https://gerrit.wikimedia.org/r/311720 (https://phabricator.wikimedia.org/T146146) [14:58:11] (03CR) 10Hashar: [C: 031] "jobrunner logging got broken when we moved from upstart to systemd. Please note the child change https://gerrit.wikimedia.org/r/#/c/31171" [puppet] - 10https://gerrit.wikimedia.org/r/311702 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [14:59:24] (03CR) 10Hashar: [C: 031] "jobrunner logging got broken when we moved from upstart to systemd." [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [15:00:11] (03PS2) 10Rush: labstore: assign IP for secondary interface [dns] - 10https://gerrit.wikimedia.org/r/311713 (https://phabricator.wikimedia.org/T144183) [15:00:37] (03PS2) 10Jcrespo: mariadb: Disable /root/.my.cnf distribution [puppet] - 10https://gerrit.wikimedia.org/r/311720 (https://phabricator.wikimedia.org/T146146) [15:00:43] (03CR) 10Hashar: "(NOTE: on production one will need to run salt to change the group of all files in /var/log/mediawiki/ )." [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [15:04:44] (03CR) 10Rush: [C: 032] labstore: assign IP for secondary interface [dns] - 10https://gerrit.wikimedia.org/r/311713 (https://phabricator.wikimedia.org/T144183) (owner: 10Rush) [15:05:10] (03PS3) 10Giuseppe Lavagetto: Have confctl exit with status code 1 if one action fails. [software/conftool] - 10https://gerrit.wikimedia.org/r/311678 [15:05:12] (03PS1) 10Giuseppe Lavagetto: Fix tox.ini [software/conftool] - 10https://gerrit.wikimedia.org/r/311722 [15:05:50] 06Operations, 06Labs, 07Tracking: Performance test new secondary labstore HA cluster - https://phabricator.wikimedia.org/T146153#2652274 (10madhuvishy) [15:06:10] 06Operations, 06Labs, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster - https://phabricator.wikimedia.org/T146154#2652289 (10madhuvishy) [15:07:30] (03PS3) 10Jcrespo: mariadb: Disable /root/.my.cnf and grants distribution to all dbs [puppet] - 10https://gerrit.wikimedia.org/r/311720 (https://phabricator.wikimedia.org/T146146) [15:07:55] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix tox.ini [software/conftool] - 10https://gerrit.wikimedia.org/r/311722 (owner: 10Giuseppe Lavagetto) [15:08:58] (03PS1) 10Madhuvishy: labstore: Add monitoring script for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [15:09:37] (03CR) 10Jcrespo: [C: 032] mariadb: Disable /root/.my.cnf and grants distribution to all dbs [puppet] - 10https://gerrit.wikimedia.org/r/311720 (https://phabricator.wikimedia.org/T146146) (owner: 10Jcrespo) [15:09:53] 06Operations, 06Labs, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster [tracking] - https://phabricator.wikimedia.org/T146154#2652316 (10madhuvishy) [15:10:56] (03PS3) 10Madhuvishy: nfsclient: Create /data/scratch symlink only if mount is present [puppet] - 10https://gerrit.wikimedia.org/r/308941 [15:11:27] (03PS3) 10Giuseppe Lavagetto: Generalize entities definitions [software/conftool] - 10https://gerrit.wikimedia.org/r/288609 [15:14:41] (03CR) 10Volans: [C: 031] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/311678 (owner: 10Giuseppe Lavagetto) [15:15:01] (03CR) 10BBlack: [C: 031] Have confctl exit with status code 1 if one action fails. [software/conftool] - 10https://gerrit.wikimedia.org/r/311678 (owner: 10Giuseppe Lavagetto) [15:15:41] (03CR) 10Giuseppe Lavagetto: [C: 032] Have confctl exit with status code 1 if one action fails. [software/conftool] - 10https://gerrit.wikimedia.org/r/311678 (owner: 10Giuseppe Lavagetto) [15:15:54] <_joe_> so let's cut a minor release [15:16:00] (03CR) 10RobH: "GlobalSign does indeed have a regular renewal process, which doesn't change the private key. I've changed the private key each time under" [puppet] - 10https://gerrit.wikimedia.org/r/311641 (https://phabricator.wikimedia.org/T144568) (owner: 10RobH) [15:16:53] (03CR) 10Hashar: "I ran it via the puppet compiler for mw1299 and mw1161 (jobrunners) but they show as noop ..." [puppet] - 10https://gerrit.wikimedia.org/r/311702 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [15:16:56] (03CR) 10Hashar: "I ran it via the puppet compiler for mw1299 and mw1161 (jobrunners) but they show as noop ..." [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [15:17:30] (03PS1) 10Rush: labs: tc-setup add 'clean' option to remove shaping [puppet] - 10https://gerrit.wikimedia.org/r/311725 [15:18:35] (03PS1) 10BBlack: N-hit-wonder: 4-hit, and improve filtering [puppet] - 10https://gerrit.wikimedia.org/r/311726 (https://phabricator.wikimedia.org/T144187) [15:18:40] 06Operations, 10MediaWiki-JobRunner, 07Beta-Cluster-reproducible, 13Patch-For-Review: wikidev people cant read /var/log/mediawiki/jobrunner.log - https://phabricator.wikimedia.org/T146040#2652356 (10hashar) p:05Triage>03High I have applied the patches to the beta cluster and that makes the log readable... [15:19:12] 06Operations, 10Traffic, 07LDAP: update ldap-[codfw|eqiad].wikimedia.org certificates (expire on 2016-09-20) - https://phabricator.wikimedia.org/T145201#2652358 (10RobH) a:05MoritzMuehlenhoff>03RobH I can try to get our money back, but it is doubtful. I'll pull this to me for now. [15:20:15] godog: _joe_: I have a couple patches to fix the jobrunner rsyslog configuration that I have tested on beta. Is that something I can add to puppet swat ? (I cant attend though) [15:20:39] godog: _joe_: the reason is that non root cant read the /var/log/mediawiki/job*.log files, they are not wikidev readable [15:21:11] PROBLEM - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is CRITICAL: Connection timed out [15:21:22] paged [15:21:48] yuvipanda / chase ^ ? [15:21:53] chasemp even =] [15:22:00] yeah I just saw it as well, not sure what's up yet [15:23:15] everything there shifted to unknown it seems like mostly [15:23:22] smacks of bad tests or something [15:24:03] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: Connection timed out [15:24:43] I'm going to downtime this, I think it's an issue w/ the checker itself atm but still confirming [15:25:16] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 736 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4302826 keys - replication_delay is 736 [15:26:32] (03PS1) 10Giuseppe Lavagetto: Release 0.3.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/311729 [15:27:41] <_joe_> hashar: uh? since the switch to jessie I guess? [15:28:34] _joe_ I created the jessie jobrunner in deployment prep today [15:28:45] (03CR) 10BBlack: [C: 032] N-hit-wonder: 4-hit, and improve filtering [puppet] - 10https://gerrit.wikimedia.org/r/311726 (https://phabricator.wikimedia.org/T144187) (owner: 10BBlack) [15:29:11] _joe_: yeah [15:29:17] the old one is disabled via puppet (jobrunner/chron stopped) [15:29:29] _joe_: the rsyslog rule had a typo preventing jobchron from being directed to jobchron.log [15:29:47] and I believe rsyslog default to files belonging to root:adm which are 0640 [15:30:03] hardly a problem since nobody apparently needed those log since the switch early in July [15:30:29] but it turns out I need to look at them. So I went bold and fixed the rsyslog conf (iterated / verified on beta deployment-jobrunner02) [15:30:50] (jobrunner02 == jessie one) [15:30:53] I am quite lucky elukey got a Jessie jobrunner on beta :] [15:31:11] it has been an ideal timing for me to test out the rsyslog rule [15:33:06] _joe_: having access to the log, I will be able to upgrade jobrunner tonight to get some debug logs added and investigate an ongoing issue with jobs failling mysteriously [15:33:14] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4300237 keys - replication_delay is 0 [15:33:24] (03PS3) 10Ema: base: add run-no-puppet [puppet] - 10https://gerrit.wikimedia.org/r/311671 [15:33:39] <_joe_> hashar: what's the patch? [15:34:09] bah sorry [15:34:17] I have too many windows [15:34:47] _joe_: fix up a couple typos for jobchron https://gerrit.wikimedia.org/r/#/c/311702/1/modules/mediawiki/files/jobrunner.rsyslog.conf [15:34:52] which is imho entirely safe [15:35:04] and then a change that refactor the rsyslog logic https://gerrit.wikimedia.org/r/#/c/311719/1/modules/mediawiki/files/jobrunner.rsyslog.conf [15:35:20] based on rsyslog 8 which we have on jessie and slightly more readable [15:35:32] well not really more readable :] [15:35:34] (03PS2) 10Giuseppe Lavagetto: jobrunner: fix rsyslog for jobchron service [puppet] - 10https://gerrit.wikimedia.org/r/311702 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [15:35:43] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] jobrunner: fix rsyslog for jobchron service [puppet] - 10https://gerrit.wikimedia.org/r/311702 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [15:35:55] (03CR) 10Ema: base: add run-no-puppet (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311671 (owner: 10Ema) [15:39:09] _joe_: gotta go sorry :( [15:39:48] <_joe_> hasharAway: the patch works fine, fwiw [15:42:04] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.024 second response time [15:43:07] RECOVERY - toolschecker service itself needs to return OK on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.032 second response time [15:45:10] (03CR) 10Volans: "Thanks for the fixes, another thing I've though about inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/311671 (owner: 10Ema) [15:49:15] (03PS2) 10Giuseppe Lavagetto: etcd::ssl: fix puppet ssldir path [puppet] - 10https://gerrit.wikimedia.org/r/309929 (https://phabricator.wikimedia.org/T144703) (owner: 10Alex Monk) [15:49:26] (03CR) 10Alexandros Kosiaris: [C: 031] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/311671 (owner: 10Ema) [15:51:03] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/4133/ noop in production" [puppet] - 10https://gerrit.wikimedia.org/r/309929 (https://phabricator.wikimedia.org/T144703) (owner: 10Alex Monk) [15:51:17] <_joe_> Krenair: I'm merging it now [15:51:21] ok [15:57:29] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Configure varnish to include wdqs nodes in codfw - https://phabricator.wikimedia.org/T146158#2652420 (10Gehel) [15:57:54] Krenair: do we know what branch is being deployed this week yet? :P [15:58:11] I haven't been tracking that [15:58:16] ahh, okay! [15:58:34] the calendar still says 19->20 but of course that is wrong.. [15:58:34] (03PS2) 10Giuseppe Lavagetto: conftool: get conf from class parameters [puppet] - 10https://gerrit.wikimedia.org/r/310459 (owner: 10Alex Monk) [15:58:36] addshore: we are back on wmf18 [15:58:45] aude: indeed, but will we be going to 19 or straight to 20? [15:58:56] probably 19 [15:59:06] okay! [15:59:13] * aude isn't sure what to do with wikidata [15:59:21] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: codfw: (2) wqds200[12] systems - https://phabricator.wikimedia.org/T138637#2652436 (10Gehel) The configuration of those servers is tracked on T144380. This task can probably be closed, unless it is still used to track s... [15:59:29] i'll take a look tomorrow to see what happened nonetheless! [15:59:42] has a new branch for wikibase prepared, in case, but maybe sticking with the old stuff + backports is better [15:59:43] well aude 19mw and 19wikibase work fine together! [15:59:57] they must, since they were deployed last week [16:00:04] godog and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160920T1600). Please do the needful. [16:00:04] Krenair and hashar: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:13] yeh, the things to do with loadbalancer etc would have been in the 20 branch [16:00:17] yeah [16:00:27] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: Configure varnish to include wdqs nodes in codfw - https://phabricator.wikimedia.org/T146158#2652439 (10Gehel) [16:00:32] (still not quite sure how that got throguh CI)! [16:00:44] :/ [16:01:10] maybe a wmf20 branch wouldn't have [16:01:42] or certainly would have broke test.wikipedia [16:01:56] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: get conf from class parameters [puppet] - 10https://gerrit.wikimedia.org/r/310459 (owner: 10Alex Monk) [16:03:58] (03PS2) 10RobH: mail.wikimedia.org cert expires on Thursday 2016-09-22 [puppet] - 10https://gerrit.wikimedia.org/r/311641 (https://phabricator.wikimedia.org/T144568) [16:04:43] <_joe_> jynus: Krenair {{done}} [16:04:46] <_joe_> err [16:04:51] <_joe_> -jynus [16:05:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] Create a new LDAP schema extension for custom user attributes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/311694 (https://phabricator.wikimedia.org/T146102) (owner: 10Muehlenhoff) [16:06:53] _joe_, they've become non-cherry-picks, thanks [16:07:50] puppet on deployment-conf03 is happy [16:07:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This needs a coordinated logrotate change, not merging for puppetSWAT, but in general LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [16:08:03] <_joe_> Krenair: :)) [16:08:10] <_joe_> ok puppetSWAT done [16:08:48] hrm. So I'm not entirely sure what to do about branching since we still don't have wmf.19 in place. [16:09:37] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/config.yaml] [16:09:42] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/config.yaml] [16:09:48] PROBLEM - puppet last run on mw2213 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/config.yaml] [16:09:53] there is a performance regression issue, but that issue is present in wmf.18 which is on all wikis currently :\ [16:09:57] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/config.yaml] [16:10:40] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/config.yaml] [16:10:53] _joe_: ^^^ ? [16:11:02] <_joe_> uhm [16:11:09] <_joe_> shit [16:11:23] <_joe_> I hope it's just transient [16:11:27] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/config.yaml] [16:11:28] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/conftool/config.yaml] [16:11:30] <_joe_> so we removed a file [16:11:39] <_joe_> and the catalog was already compiled [16:11:43] Could not retrieve information from environment production source(s) puppet://///modules/conftool/production.config.yaml [16:11:46] <_joe_> yes [16:12:14] (03PS1) 10Rush: labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 [16:12:18] <_joe_> we removed that file [16:12:29] <_joe_> and hosts which had already compiled their catalog [16:12:35] <_joe_> were still asking for it [16:12:48] ok make sense [16:13:06] running on mw2239 [16:13:09] RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:13:11] <_joe_> alreadyd i [16:13:14] <_joe_> *did [16:13:19] <_joe_> exactly on the same host [16:13:27] lol [16:13:30] (03CR) 10jenkins-bot: [V: 04-1] labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 (owner: 10Rush) [16:13:38] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:14:23] yuvipanda: hashar's quarry change would be nice to have, if you don't mind (at any rate, it has lots of time) https://gerrit.wikimedia.org/r/#/c/308313/ [16:16:01] _joe_: auch.. I've put my change to the wrong week puppet swat window... Could you pleeeease do this one to https://gerrit.wikimedia.org/r/#/c/311594/ ? [16:16:28] <_joe_> ahahaha [16:16:39] (03PS2) 10Rush: labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 [16:16:41] <_joe_> Pchelolo: that will cost you a drink next time we meet :P [16:16:55] _joe_: okey, no problem :) [16:17:02] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:17:02] (03CR) 10Giuseppe Lavagetto: [C: 032] RESTBase: Specify the topic for transclusions. [puppet] - 10https://gerrit.wikimedia.org/r/311594 (https://phabricator.wikimedia.org/T145804) (owner: 10Ppchelko) [16:17:11] (03PS2) 10Giuseppe Lavagetto: RESTBase: Specify the topic for transclusions. [puppet] - 10https://gerrit.wikimedia.org/r/311594 (https://phabricator.wikimedia.org/T145804) (owner: 10Ppchelko) [16:17:19] _joe_: thank you :) [16:17:21] (03PS3) 10Dzahn: admin: create shell account for Sam Walton [puppet] - 10https://gerrit.wikimedia.org/r/311473 (https://phabricator.wikimedia.org/T145788) [16:17:25] (03CR) 10Giuseppe Lavagetto: [V: 032] RESTBase: Specify the topic for transclusions. [puppet] - 10https://gerrit.wikimedia.org/r/311594 (https://phabricator.wikimedia.org/T145804) (owner: 10Ppchelko) [16:17:49] <_joe_> Pchelolo: want me to run puppet across the rb cluster? [16:18:08] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:18:09] _joe_: no, it's a preparation for future, we didn't even deploy the code yet [16:18:26] <_joe_> ok so I can just let puppet run its course [16:18:56] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Request for access to stat1003 for Sam Walton - https://phabricator.wikimedia.org/T145788#2640942 (10Dzahn) Hi @Samwalton9 this is ready to go, we'll just need the manager approval, then i'll start by merging the change above to create your user accoun... [16:18:58] (03PS3) 10Rush: labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 [16:19:09] (03PS2) 10Rush: labs: tc-setup add 'clean' option to remove shaping [puppet] - 10https://gerrit.wikimedia.org/r/311725 [16:21:02] <_joe_> Pchelolo: puppet is happy, in ~ 30 minutes the change will be applied everywhere [16:21:16] _joe_: thank you :) [16:23:11] (03CR) 10Faidon Liambotis: [C: 04-1] Create a new LDAP schema extension for custom user attributes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/311694 (https://phabricator.wikimedia.org/T146102) (owner: 10Muehlenhoff) [16:27:23] (03PS12) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [16:28:55] 06Operations: Migrate Graphana dashboard "labs-project-board" from prod to labs Graphana instance - https://phabricator.wikimedia.org/T146136#2652500 (10Addshore) 05Open>03Resolved a:03Addshore Please see https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board [16:29:25] (03PS2) 10Dduvall: beta: Install MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/310360 (https://phabricator.wikimedia.org/T138778) [16:29:51] 06Operations: Migrate Graphana dashboard "labs-project-board" from prod to labs Graphana instance - https://phabricator.wikimedia.org/T146136#2652503 (10Addshore) It is actually easier than copying the JSON. There is an export link for each dashboard, once exported the file can be imported to the labs grafana in... [16:29:59] 06Operations, 15User-Addshore: Migrate Graphana dashboard "labs-project-board" from prod to labs Graphana instance - https://phabricator.wikimedia.org/T146136#2652504 (10Addshore) [16:32:28] !log restarting cassandra on aqs100[56] (started the work earlier on today, stopped due to T146130) [16:32:29] T146130: Inconsistent Cassandra disk load shown in metrics and nodetool status - https://phabricator.wikimedia.org/T146130 [16:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:27] RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:34:28] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:35:12] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:35:12] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:35:27] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:35:28] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:37:27] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:38:53] (03PS2) 10Dzahn: admin: create shell account for Deborah Tankersley [puppet] - 10https://gerrit.wikimedia.org/r/311482 (https://phabricator.wikimedia.org/T145914) [16:39:07] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:39:18] (03CR) 10Dzahn: [C: 032] admin: create shell account for Deborah Tankersley [puppet] - 10https://gerrit.wikimedia.org/r/311482 (https://phabricator.wikimedia.org/T145914) (owner: 10Dzahn) [16:40:27] 06Operations, 06Labs, 07Tracking: Migrate tools and misc(others) to secondary labstore HA cluster [tracking] - https://phabricator.wikimedia.org/T146154#2652606 (10chasemp) [16:41:18] 06Operations, 10hardware-requests: eqiad: (4) worker servers for kubernetes - https://phabricator.wikimedia.org/T141624#2652609 (10RobH) So the quotes are now all in on the procurement sub task. However, they will not be ordered in time for the ops offsite next week, and we wanted some systems in place for th... [16:42:31] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:43:39] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:44:32] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2652623 (10Dzahn) The user as such has been created, so bastion access is granted. Additional groups are pending the question... [16:47:53] marxarelli, feel free to tell me if you need me to merge to production the mariadb patches [16:48:42] jynus: ok. i cherry picked them for now, but i'm not sure i understand our mariadb 10 package enough to get it started [16:48:53] !log increase recovery bandwidth on elasticsearch eqiad to match codfw - T145404 [16:48:54] T145404: Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404 [16:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:24] i.e. systemctl start mysql is still starting up mysqld from the 5.5 package [16:49:47] since we're towards the end of the window, i might just continue with 5.5 and we can upgrade another day this week [16:49:52] marxarelli, you must have both pacakges installed [16:49:58] yep :) [16:50:07] uninstall 5.5 [16:50:18] if it's as easy as purging 5.5, i can do that [16:50:18] and run /opt/wmf-mariadb10/install [16:50:37] ok, i'll try it [16:50:41] other than that, we're looking good data wise [16:51:12] the packages were designed to allow several of them isntalled at the same tiem [16:51:17] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2652671 (10Dzahn) Hi @debt You have a user now that can connect to the bastion hosts (bast1001, bast2001, bast3001, bast4001)... [16:51:50] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2652031 (10ori) I've had good luck in the past with [[ https://pypi.python.org/pypi/pyrasite | Pyrasite ]] and [[ https://github.com/ionelmc/python-manh... [16:52:30] (03PS1) 10Alexandros Kosiaris: WIP puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 [16:53:40] (03CR) 10jenkins-bot: [V: 04-1] WIP puppetmaster: servermon report handler [puppet] - 10https://gerrit.wikimedia.org/r/311738 (owner: 10Alexandros Kosiaris) [16:54:40] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2652691 (10Dzahn) @debt By the way, any shell user also has access to people.wikimedia.org (via the host called "rutherfordiu... [16:55:09] (03PS1) 10Jcrespo: mariadb: remove /root/.my.cnf from module permanently [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311739 (https://phabricator.wikimedia.org/T146146) [16:55:12] jynus: looking good now. "Version: '10.0.23-MariaDB-log' socket: '/tmp/mysql.sock' port: 3306 MariaDB Server" [16:55:16] nice [16:55:20] sorry about that [16:55:23] did mysql_upgrade and restarted mysql.service [16:55:27] no worries [16:55:34] it is a temporary thing [16:55:43] next package version will solve it [16:55:57] (during the 5.5 -> 10 upgrade) [16:56:07] yeah, makes total sense when you have to upgrade with care :) [16:56:11] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2652693 (10Krenair) >>! In T145360#2651963, @jcrespo wrote: >> Oh, so you want to change the puppet manifests around so silver can run those jobs for itself, and then change the mysql password and have it use a different one of those... [16:56:32] marxarelli, check if performance_schema is enabled [16:56:41] it requires restart, and can be configured later [16:56:45] now i'm just waiting on the uncompress on deployment-db04 and i'll setup replication [16:57:09] ok [16:58:13] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2652695 (10RobH) [16:58:15] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2652713 (10AlexMonk-WMF) [16:58:18] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Puppet failing on deployment-conf03 due to missing files - https://phabricator.wikimedia.org/T144703#2652712 (10AlexMonk-WMF) 05Open>03Resolved [16:58:23] !log adding aqs1005 to live traffic - aqs.svc.eqiad.wmnet - T144497 [16:58:24] T144497: Switch AQS to new cluster - https://phabricator.wikimedia.org/T144497 [16:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:03] 06Operations, 10Mail: status of wikigroup@ alias - https://phabricator.wikimedia.org/T127551#2652721 (10bbogaert) Sure. Thanks for the bump. I'll setup the google group. [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160920T1700). Please do the needful. [17:00:13] no parsoid deploy today [17:01:53] !log adding aqs1006 to live traffic - aqs.svc.eqiad.wmnet - T144497 [17:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:02:59] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:03:20] 06Operations, 10ops-eqiad: get port info for wmf4747/wmf4748/wmf4749/wmf4750 - https://phabricator.wikimedia.org/T146172#2652733 (10RobH) [17:03:24] (03PS1) 10Andrew Bogott: Set CHARSET=utf8mb4 for the labspuppetbackend tables. [puppet] - 10https://gerrit.wikimedia.org/r/311742 (https://phabricator.wikimedia.org/T133412) [17:03:57] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:59] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2652695 (10RobH) I just went to allocate the ports on these, and it turns out they aren't labeled on the switch. I've created sub-task T146172 for @Cmjohnson to pull the info for these. I... [17:05:18] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2652758 (10jcrespo) I honestly would accept no separation, if the host was called other than labstestweb- labswiki2, silver2, wikitech2, mw2300. Anything that does not implies "test" or "labs VM", so we could grant access to anyone b... [17:05:19] (03CR) 10BryanDavis: [C: 031] Set CHARSET=utf8mb4 for the labspuppetbackend tables. [puppet] - 10https://gerrit.wikimedia.org/r/311742 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [17:05:29] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 701 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4302646 keys - replication_delay is 701 [17:07:36] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2652761 (10MoritzMuehlenhoff) I've rebuilt the host as deployment-mira02 with /srv/ on a separate 20 GB partition and deleted the mira02 instance... [17:08:09] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4301615 keys - replication_delay is 0 [17:08:58] (03CR) 10Jcrespo: [C: 032] mariadb: remove /root/.my.cnf from module permanently [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/311739 (https://phabricator.wikimedia.org/T146146) (owner: 10Jcrespo) [17:12:18] (03PS1) 10RobH: setting ip addresses for temp kubernetes hosts [dns] - 10https://gerrit.wikimedia.org/r/311744 [17:14:13] (03CR) 10RobH: [C: 032] setting ip addresses for temp kubernetes hosts [dns] - 10https://gerrit.wikimedia.org/r/311744 (owner: 10RobH) [17:15:07] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2652780 (10RobH) [17:20:50] (03CR) 10Yuvipanda: [C: 031] Set CHARSET=utf8mb4 for the labspuppetbackend tables. [puppet] - 10https://gerrit.wikimedia.org/r/311742 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [17:22:21] (03CR) 10Andrew Bogott: [C: 032] Set CHARSET=utf8mb4 for the labspuppetbackend tables. [puppet] - 10https://gerrit.wikimedia.org/r/311742 (https://phabricator.wikimedia.org/T133412) (owner: 10Andrew Bogott) [17:22:31] (03CR) 10Rush: nfsclient: Create /data/scratch symlink only if mount is present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308941 (owner: 10Madhuvishy) [17:22:36] (03PS4) 10Rush: nfsclient: Create /data/scratch symlink only if mount is present [puppet] - 10https://gerrit.wikimedia.org/r/308941 (owner: 10Madhuvishy) [17:27:08] !log update RESTBase to 4829630f canary on restbase1007 [17:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:28] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:30:32] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: access request for debt on stat1003, stat1002, and fluorine for Deborah Tankersley - https://phabricator.wikimedia.org/T145914#2644948 (10mpopov) >>! In T145914#2652239, @elukey wrote: >>>! In T145914#2652206, @debt wrote: >> Hi @elukey - I probably do... [17:30:59] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2652826 (10Krenair) >>! In T145360#2652758, @jcrespo wrote: > I honestly would accept no separation, if the host was called other than labstestweb- labswiki2, silver2, wikitech2, mw2300. Anything that does not implies "test" or "labs... [17:31:05] marostegui: I was going to do another PageAssessments roll-out today, but I noticed that there is an abnormally high level of "replace" activity on the s3 database currently: https://tendril.wikimedia.org/host/view/db1075.eqiad.wmnet/3306 [17:31:15] Any idea what might be causing it? [17:31:34] last time it had a steady average of about 6,000 replaces per minute(?) and now it's averaging about 12,000. [17:33:04] kaldari: Not sure - I will check in a bit, I am in a meeting, don't know if jynus knows something about it [17:33:12] thanks [17:33:46] maybe it'll die down [17:34:31] PROBLEM - Restbase root url on restbase1007 is CRITICAL: Connection refused [17:34:38] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.223, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:35:08] ^^^ it's all right, cassandra schema update takes quite a long time [17:35:35] kaldari, that is "REPLACE /* DatabaseMysqlBase::replace */ INTO `module_deps`", and it is normal [17:35:49] "normal" [17:36:08] jynus: Thanks, I just wanted to check since it's twice the level it was last time [17:36:10] I do not know why we need so many, but they have been there forever [17:36:11] (03PS5) 10Madhuvishy: nfsclient: Create /data/scratch symlink only if mount is present [puppet] - 10https://gerrit.wikimedia.org/r/308941 [17:37:54] (03PS3) 10Dduvall: beta: Configure storage cluster for migrated databases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305675 (https://phabricator.wikimedia.org/T138778) [17:37:57] jynus: it looks like the rate went up significantly around Sept 8., but I won't worry about it for now. [17:38:20] (03CR) 10Rush: nfsclient: Create /data/scratch symlink only if mount is present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308941 (owner: 10Madhuvishy) [17:39:19] kaldari, this is a good one to check all nodes of a shard at the same time: https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated [17:39:50] jynus: that's exactly what I need. Thanks! [17:40:06] e.g. https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?from=1474389595528&to=1474393195528&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s3&var-role=All [17:40:11] (03CR) 10Dduvall: [C: 032] beta: Configure storage cluster for migrated databases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305675 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [17:40:32] (03CR) 10Madhuvishy: nfsclient: Create /data/scratch symlink only if mount is present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308941 (owner: 10Madhuvishy) [17:40:37] (03Merged) 10jenkins-bot: beta: Configure storage cluster for migrated databases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305675 (https://phabricator.wikimedia.org/T138778) (owner: 10Dduvall) [17:40:38] that will show increases in WPS, rows read and written, and max lag [17:41:20] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [17:42:00] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15293 bytes in 0.022 second response time [17:42:18] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [17:42:37] (03PS1) 10RobH: setting install params for temp kubernetes test workers [puppet] - 10https://gerrit.wikimedia.org/r/311746 [17:43:09] (03CR) 10RobH: [C: 032] setting install params for temp kubernetes test workers [puppet] - 10https://gerrit.wikimedia.org/r/311746 (owner: 10RobH) [17:43:51] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2652884 (10thcipriani) >>! In T144578#2652761, @MoritzMuehlenhoff wrote: > I've rebuilt the host as deployment-mira02 with /srv/ on a separate 20... [17:45:25] (03PS1) 10Alex Monk: Unbreak I863367b8: Fix existing check_graphite(_series)?_threshold callers [puppet] - 10https://gerrit.wikimedia.org/r/311747 [17:46:41] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2652886 (10RobH) [17:46:43] ottomata, ^ [17:46:47] I think your change caused some issues [17:46:53] like shinken having a massive fit in -labs earlier [17:47:16] It started generating commands like /usr/lib/nagios/plugins/check_graphite -U https://graphite-labs.wikimedia.org -T 10 check_series_threshold 'deployment-prep.deployment-apertium01.diskspace.*.byte_percentfree' -W 15 -C 10 --from 10min --until 1 --perc --under --allow-undefined [17:47:28] !log update RESTBase to 4829630f [17:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:30] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 07Wikimedia-Incident: Deploy WDQS nodes on codfw - https://phabricator.wikimedia.org/T124862#2652916 (10RobH) [17:50:33] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: codfw: (2) wqds200[12] systems - https://phabricator.wikimedia.org/T138637#2652915 (10RobH) 05Open>03Resolved [17:51:13] (03CR) 10Alex Monk: "This broke shinken: I80040bee" [puppet] - 10https://gerrit.wikimedia.org/r/300548 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [17:53:38] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.038 second response time [17:55:07] 06Operations, 06Labs: Puppet broken on labcontrol1002 - https://phabricator.wikimedia.org/T145185#2652951 (10Andrew) 05Open>03Resolved This was fixed with a hiera change a while ago. [17:57:23] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2652969 (10Paladox) It seems there is a bug with the text field of ^^ [18:00:05] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160920T1800). [18:01:37] * hasharAway declines [18:01:43] looks like nothing is scheduled [18:03:01] ottomata: can you look at https://gerrit.wikimedia.org/r/311747 please? It follow up to a change you made to check_graphite :) [18:09:35] looking [18:10:02] oh huh [18:10:57] hasharAway: are those defined differently that the usual puppet defines? [18:11:18] monitor::graphite stuff? [18:11:56] yeah huh [18:12:06] hasharAway: why aren't those created with monitoring::grphiate and monitoring::graphite_threshold? [18:12:14] i'll ask on ticket [18:12:30] (03CR) 10Ottomata: "Why aren't these created using monitoring::graphite and monitoring::graphite_threshold?" [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk) [18:13:00] (03CR) 10Ottomata: "If they weren't hardcoded into .cfg files, these would have automatically been fixed to be compatible in the previous patch." [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk) [18:13:08] (03PS2) 10Hashar: jobrunner: refactor rsyslog conf and let wikidev read log [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) [18:13:10] (03PS1) 10Hashar: jobrunner: log rotate jobchron.log [puppet] - 10https://gerrit.wikimedia.org/r/311750 (https://phabricator.wikimedia.org/T96132) [18:13:14] (03CR) 10Alex Monk: "I'll look into it after shinken has started working again" [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk) [18:14:33] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review: jobchron logs are not rotated - https://phabricator.wikimedia.org/T96132#2653026 (10hashar) 05Resolved>03Open jobchron was no more logged. I send a fix for the rsyslog configuration 8499f49249ebd2f4493f5d36996c7d4589c1400a . And there is no logrota... [18:14:45] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2653029 (10Ottomata) These installs don't have to happen at the same time. Since there are eqiad spares, can we install one of these before we make the... [18:14:55] (03CR) 10Ottomata: [C: 032] Unbreak I863367b8: Fix existing check_graphite(_series)?_threshold callers [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk) [18:14:59] (03PS2) 10Ottomata: Unbreak I863367b8: Fix existing check_graphite(_series)?_threshold callers [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk) [18:15:11] (03CR) 10Ottomata: [V: 032] Unbreak I863367b8: Fix existing check_graphite(_series)?_threshold callers [puppet] - 10https://gerrit.wikimedia.org/r/311747 (owner: 10Alex Monk) [18:15:22] Krenair: thanks btw [18:16:49] ottomata: I have no idea sorry :( Just noticed the follow up change in my gerrit mails :D [18:17:05] (03PS1) 10Yuvipanda: labs: Setup the standalone puppetmaster to use ENC [puppet] - 10https://gerrit.wikimedia.org/r/311751 [18:17:17] (03PS1) 10Jcrespo: Refactor mariadb role to add role mariadb::grants::production [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) [18:18:53] ottomata, thanks for merging... please try to be more careful next time :) [18:19:41] haha, i will take a little bit of the blame on that one, but really those files shoudln't exist like that [18:20:09] if you want to use the check_graphite script from puppet, you should use the puppet define that knows how to use it [18:20:35] (s/you/generic you/ :) ) [18:21:06] (03CR) 10jenkins-bot: [V: 04-1] Refactor mariadb role to add role mariadb::grants::production [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) (owner: 10Jcrespo) [18:22:33] 06Operations, 06Editing-Department, 10Monitoring, 06Release-Engineering-Team, 07Wikimedia-Incident: High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090#2653054 (10Tgr) Note that failure means the authentication code ran successfully but... [18:22:51] (03PS2) 10Hashar: jobrunner: log rotate jobchron.log [puppet] - 10https://gerrit.wikimedia.org/r/311750 (https://phabricator.wikimedia.org/T96132) [18:23:12] ottomata, so I'm looking at it again because shinken hasn't recovered yet [18:23:26] think I got things the wrong way around [18:23:30] oh? [18:23:48] (03PS3) 10Hashar: jobrunner: log rotate jobchron.log [puppet] - 10https://gerrit.wikimedia.org/r/311750 (https://phabricator.wikimedia.org/T96132) [18:23:50] (03PS3) 10Hashar: jobrunner: refactor rsyslog conf and let wikidev read log [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) [18:24:07] (03PS2) 10Jcrespo: Refactor mariadb role to add role mariadb::grants::production [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) [18:24:44] ottomata, think I may have things the wrong way around [18:24:55] lemme get a prod example of one... [18:25:24] (03CR) 10Hashar: "Rebased on top of https://gerrit.wikimedia.org/r/#/c/311750/ which adds a logrotate file for /var/log/mediawiki/jobchron.log" [puppet] - 10https://gerrit.wikimedia.org/r/311719 (https://phabricator.wikimedia.org/T146040) (owner: 10Hashar) [18:25:48] check_graphite_threshold!https://graphite.wikimedia.org!10!movingAverage(eventlogging.overall.inserted.rate, "10min")!50!10!25min!10min!20!--under [18:26:12] (03PS2) 10Yuvipanda: labs: Setup the standalone puppetmaster to use ENC [puppet] - 10https://gerrit.wikimedia.org/r/311751 [18:26:18] (03PS1) 10Alex Monk: Follow-up I80040bee: Fix my order of parameters [puppet] - 10https://gerrit.wikimedia.org/r/311754 [18:26:23] ottomata, ^ [18:26:24] until => '10min', [18:26:28] (03PS3) 10Yuvipanda: labs: Setup the standalone puppetmaster to use ENC [puppet] - 10https://gerrit.wikimedia.org/r/311751 [18:26:31] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Setup the standalone puppetmaster to use ENC [puppet] - 10https://gerrit.wikimedia.org/r/311751 (owner: 10Yuvipanda) [18:26:31] so yeah you might be one off [18:26:32] looking [18:26:55] looks right [18:26:57] waiting for jenkins [18:27:07] modules/monitoring/manifests/graphite_threshold.pp says: [18:27:08] # $ARG7$ --until end sampling date (negative relative time from now) [18:27:08] # $ARG8$ --perc percentage of exceeding datapoints [18:27:18] aye [18:27:33] those are so easy to mix up too! [18:27:34] ha [18:27:41] !$ARG!1!1!...! [18:27:42] hah [18:27:49] (03CR) 10Ottomata: [C: 032] Follow-up I80040bee: Fix my order of parameters [puppet] - 10https://gerrit.wikimedia.org/r/311754 (owner: 10Alex Monk) [18:27:53] (03PS2) 10Ottomata: Follow-up I80040bee: Fix my order of parameters [puppet] - 10https://gerrit.wikimedia.org/r/311754 (owner: 10Alex Monk) [18:27:55] (03CR) 10Ottomata: [V: 032] Follow-up I80040bee: Fix my order of parameters [puppet] - 10https://gerrit.wikimedia.org/r/311754 (owner: 10Alex Monk) [18:28:03] merged [18:28:08] ty [18:29:28] ottomata: looks like check graphite is fixed on labs :D [18:29:45] ottomata, that did the trick, thanks [18:29:51] thanks [18:29:53] will look into having it generated later [18:29:56] cool, thank you [18:33:28] hi hoo [18:34:56] (03PS2) 10Dzahn: admin: create shell account for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311485 (https://phabricator.wikimedia.org/T145387) [18:35:20] hi [18:36:56] excess flood already? [18:39:07] (03CR) 10Dzahn: [C: 032] "also got approval from Juliet Barbara on gtalk" [puppet] - 10https://gerrit.wikimedia.org/r/311485 (https://phabricator.wikimedia.org/T145387) (owner: 10Dzahn) [18:41:59] Hi hoo [18:42:06] Is everything ok? [18:42:08] (03PS3) 10Jcrespo: Refactor mariadb role to add role mariadb::grants::production [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) [18:42:27] ? [18:42:49] hi audephone… I'm a little stressed by exams, but despite of that, I think so [18:42:51] are we still on wmf18 [18:42:57] Ok :) [18:43:12] (03PS4) 10Yuvipanda: puppet: Add option to use newer ENC [puppet] - 10https://gerrit.wikimedia.org/r/310952 (https://phabricator.wikimedia.org/T91990) [18:43:16] (03CR) 10jenkins-bot: [V: 04-1] Refactor mariadb role to add role mariadb::grants::production [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) (owner: 10Jcrespo) [18:43:19] (03PS4) 10Jcrespo: Refactor mariadb role to add role mariadb::grants::production [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) [18:43:34] audephone: yes, still on wmf.18, perf issue being investigated [18:43:39] Ok [18:44:05] I was hoping to get new wikidata code out this week but might just do some backports [18:44:41] audephone: only if they fix issues, no new code [18:44:41] For forward compatibility with core [18:45:09] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 680 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4312986 keys - replication_delay is 680 [18:45:19] Only preventive if new core wmf20 is deployed sometime with current wikibase [18:45:25] (03CR) 10Rush: [C: 032] nfsclient: Create /data/scratch symlink only if mount is present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308941 (owner: 10Madhuvishy) [18:45:30] (03PS6) 10Rush: nfsclient: Create /data/scratch symlink only if mount is present [puppet] - 10https://gerrit.wikimedia.org/r/308941 (owner: 10Madhuvishy) [18:45:45] Though I want to have a new branch of wikibase for wmf20 [18:46:03] right right [18:46:27] Probably would be during swat tomorrow [18:47:17] (03CR) 10Rush: [V: 032] nfsclient: Create /data/scratch symlink only if mount is present [puppet] - 10https://gerrit.wikimedia.org/r/308941 (owner: 10Madhuvishy) [18:49:31] (03PS5) 10Jcrespo: Refactor mariadb role to add role mariadb::grants::production [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) [18:51:34] (03PS5) 10Yuvipanda: puppet: Add option to use newer ENC [puppet] - 10https://gerrit.wikimedia.org/r/310952 (https://phabricator.wikimedia.org/T91990) [18:51:39] (03CR) 10Yuvipanda: [C: 032 V: 032] puppet: Add option to use newer ENC [puppet] - 10https://gerrit.wikimedia.org/r/310952 (https://phabricator.wikimedia.org/T91990) (owner: 10Yuvipanda) [18:53:40] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:54:11] (03CR) 10Jcrespo: [C: 032] "Looks sane: https://puppet-compiler.wmflabs.org/4136/" [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) (owner: 10Jcrespo) [18:54:17] (03PS6) 10Jcrespo: Refactor mariadb role to add role mariadb::grants::production [puppet] - 10https://gerrit.wikimedia.org/r/311752 (https://phabricator.wikimedia.org/T146146) [18:54:35] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2653227 (10Nuria) >@ellery, I talked about this with @mpopov Friday, and he told me that Discovery uses unique tokens as standard practice in their experiments. (They set an >ex... [18:55:47] (03CR) 10Dzahn: "oops, i didn't see this one when i made https://gerrit.wikimedia.org/r/#/c/311485/ , so the user has been created, but the groups need to " [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [18:57:12] (03PS1) 10Thcipriani: Beta: deployment-mira02 ip address updates [puppet] - 10https://gerrit.wikimedia.org/r/311760 [18:58:31] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2653269 (10Nuria) @Neil_P._Quinn_WMF , @ellery Please have in mind that in any of discovery's test there is no knowledge as to whether the user is part of another test (ex: hov... [19:00:04] To be determined: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160920T1900). Please do the needful. [19:01:11] heh [19:01:42] hey that's me [19:02:19] fixed [19:02:43] thcipriani: any update on that perf regressions? [19:02:44] -s [19:02:57] (03PS4) 10Hoo man: More error logging/ sanity checks for dumpwikidata [puppet] - 10https://gerrit.wikimedia.org/r/311551 [19:02:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4305260 keys - replication_delay is 0 [19:03:47] greg-g: nothing that I've seen https://phabricator.wikimedia.org/T146099 [19:04:11] (03PS1) 10Yuvipanda: puppet: Disable enc by default on trusty for now [puppet] - 10https://gerrit.wikimedia.org/r/311761 [19:04:42] so I suppose that means train is on hold for the time being :( [19:04:45] (03CR) 10jenkins-bot: [V: 04-1] puppet: Disable enc by default on trusty for now [puppet] - 10https://gerrit.wikimedia.org/r/311761 (owner: 10Yuvipanda) [19:05:05] yep, wanna send the email or me (I have to go take care of lunch-time child duties so I'll be delayed) [19:05:24] there's a thread that antoine started about wmf.19 rollback we could piggy back/continue [19:05:40] sure [19:05:45] (03PS1) 10Andrew Bogott: Puppet Panel: No checkboxes! [puppet] - 10https://gerrit.wikimedia.org/r/311762 (https://phabricator.wikimedia.org/T91990) [19:05:51] thanks sir, bbiab [19:06:00] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:06:02] (03PS2) 10Andrew Bogott: Puppet Panel: No checkboxes! [puppet] - 10https://gerrit.wikimedia.org/r/311762 (https://phabricator.wikimedia.org/T91990) [19:10:31] (03PS26) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [19:12:25] (03PS27) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [19:12:56] thcipriani, greg-g: are we re-deploying wmf.19 this week? or skipping to wmf.20? [19:13:39] (03CR) 10Andrew Bogott: [C: 032] Puppet Panel: No checkboxes! [puppet] - 10https://gerrit.wikimedia.org/r/311762 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [19:14:00] nothing large seems broken [19:14:11] I will return later to check no puppet run failed [19:14:19] to fix it [19:14:42] legoktm: the hope was to deploy wmf.20 this week, but we've decided to pause cutting wmf.20 to allow time to figure out the perf regression in wmf.18. Still tbd at this point. [19:14:50] ok [19:15:44] actually, labsdb are complaining [19:15:57] it will not break anything, anyway [19:16:15] (03PS4) 10Rush: labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 [19:16:49] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:27] actually, that is good that it failed [19:17:30] (03PS5) 10Rush: labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 [19:17:43] because it means something was logically wrong, there, too [19:18:19] (03PS3) 10Rush: labs: tc-setup add 'clean' option to remove shaping [puppet] - 10https://gerrit.wikimedia.org/r/311725 [19:18:49] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [19:21:38] (03CR) 10Hashar: [C: 031] "I have missed that one in my grep earlier today :(" [puppet] - 10https://gerrit.wikimedia.org/r/311760 (owner: 10Thcipriani) [19:24:13] (03PS28) 1020after4: Scap swat command [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) [19:24:56] (03CR) 10Rush: [C: 032] labs: tc-setup add 'clean' option to remove shaping [puppet] - 10https://gerrit.wikimedia.org/r/311725 (owner: 10Rush) [19:24:58] (03CR) 10Madhuvishy: [C: 031] labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 (owner: 10Rush) [19:25:03] (03PS2) 10Andrew Bogott: openstack: remove unused volume class, update default version [puppet] - 10https://gerrit.wikimedia.org/r/311304 (owner: 10Alex Monk) [19:26:26] (03PS6) 10Rush: labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 [19:26:36] (03CR) 10Hoo man: "I've tested the changed bits individually on my local machine (but not the whole script)." [puppet] - 10https://gerrit.wikimedia.org/r/311551 (owner: 10Hoo man) [19:27:10] (03PS1) 10Jcrespo: labsdb: stop using role::mariadb::grants, only used for production [puppet] - 10https://gerrit.wikimedia.org/r/311764 (https://phabricator.wikimedia.org/T146146) [19:27:27] (03CR) 10Andrew Bogott: [C: 032] "Thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/311304 (owner: 10Alex Monk) [19:27:35] (03CR) 10Rush: [C: 032 V: 032] labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 (owner: 10Rush) [19:27:40] (03PS7) 10Rush: labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 [19:27:42] (03CR) 10Rush: [V: 032] labstore: use secondary interface for DRBD replication [puppet] - 10https://gerrit.wikimedia.org/r/311732 (owner: 10Rush) [19:28:06] andrewbogott: I caught Andrew Bogott: openstack: remove unused volume class, update default version (01f6ac0) [19:28:08] merge? [19:28:41] chasemp: I'm already midway through merging it [19:28:42] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:28:43] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [19:28:55] (03PS2) 10Jcrespo: labsdb: stop using role::mariadb::grants, only used for production [puppet] - 10https://gerrit.wikimedia.org/r/311764 (https://phabricator.wikimedia.org/T146146) [19:29:44] o no, raid checks again no! [19:30:39] (03PS3) 10Dzahn: access to stats1002/3 and fluorine for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [19:31:19] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [19:31:43] (03CR) 10jenkins-bot: [V: 04-1] access to stats1002/3 and fluorine for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [19:31:48] (03CR) 10Jcrespo: [C: 032] labsdb: stop using role::mariadb::grants, only used for production [puppet] - 10https://gerrit.wikimedia.org/r/311764 (https://phabricator.wikimedia.org/T146146) (owner: 10Jcrespo) [19:32:28] (03PS4) 10Dzahn: access to stats1002/3 and fluorine for Melody Kramer [puppet] - 10https://gerrit.wikimedia.org/r/311400 (https://phabricator.wikimedia.org/T145387) (owner: 10ArielGlenn) [19:32:34] thcipriani: I'm gonna backport a small proofreadpage change that is making it really hard to edit wikisource https://gerrit.wikimedia.org/r/#/c/311765/ [19:32:48] err, the bug it fixes makes it hard to edit wikisource. the patch should do the opposite ;) [19:32:58] heh [19:33:10] as long as that's the case it sounds good to me :) [19:34:28] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:34:51] (03CR) 10Andrew Bogott: [C: 032] remove extra cluster: labvirt hieradata [puppet] - 10https://gerrit.wikimedia.org/r/309693 (owner: 10Alex Monk) [19:35:24] (03PS1) 10Yuvipanda: Revert "puppet: Add option to use newer ENC" [puppet] - 10https://gerrit.wikimedia.org/r/311766 [19:35:31] (03PS2) 10Yuvipanda: Revert "puppet: Add option to use newer ENC" [puppet] - 10https://gerrit.wikimedia.org/r/311766 [19:35:36] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "puppet: Add option to use newer ENC" [puppet] - 10https://gerrit.wikimedia.org/r/311766 (owner: 10Yuvipanda) [19:38:48] (03PS1) 10Yuvipanda: puppet: Use newer ENC [puppet] - 10https://gerrit.wikimedia.org/r/311767 (https://phabricator.wikimedia.org/T91990) [19:40:31] (03CR) 10Yuvipanda: [C: 032] puppet: Use newer ENC [puppet] - 10https://gerrit.wikimedia.org/r/311767 (https://phabricator.wikimedia.org/T91990) (owner: 10Yuvipanda) [19:40:56] (03PS1) 10Jcrespo: labsdns: Remove mysql grants from non-dedicated services [puppet] - 10https://gerrit.wikimedia.org/r/311768 (https://phabricator.wikimedia.org/T146146) [19:42:36] (03CR) 10Jcrespo: [C: 032] labsdns: Remove mysql grants from non-dedicated services [puppet] - 10https://gerrit.wikimedia.org/r/311768 (https://phabricator.wikimedia.org/T146146) (owner: 10Jcrespo) [19:42:44] (03PS2) 10Jcrespo: labsdns: Remove mysql grants from non-dedicated services [puppet] - 10https://gerrit.wikimedia.org/r/311768 (https://phabricator.wikimedia.org/T146146) [19:43:32] (03PS2) 10Andrew Bogott: labs firstboot.sh: Add instance hostname to /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/311212 (https://phabricator.wikimedia.org/T120830) (owner: 10Alex Monk) [19:43:55] 06Operations, 15User-Addshore: Migrate Graphana dashboard "labs-project-board" from prod to labs Graphana instance - https://phabricator.wikimedia.org/T146136#2653601 (10hashar) Awesome! I have fixed up the datasource for the dashboard variables. I have made the one from production to point to labs https://g... [19:45:23] (03CR) 10Andrew Bogott: [C: 032] labs firstboot.sh: Add instance hostname to /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/311212 (https://phabricator.wikimedia.org/T120830) (owner: 10Alex Monk) [19:45:29] (03PS3) 10Andrew Bogott: labs firstboot.sh: Add instance hostname to /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/311212 (https://phabricator.wikimedia.org/T120830) (owner: 10Alex Monk) [19:48:50] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:50:58] !log legoktm@tin Synchronized php-1.28.0-wmf.18/extensions/ProofreadPage/modules/page/ext.proofreadpage.page.edit.js: Makes sure that the zoom widget is initialized before zooming in/out - https://gerrit.wikimedia.org/r/#/c/311765/ (duration: 00m 48s) [19:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:51:12] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp2011 is CRITICAL: Connection refused [19:51:31] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [19:55:23] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:57:49] 06Operations, 10Cassandra, 10RESTBase-Cassandra: Efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2653634 (10Eevans) See also {T140008}, in particular the comments starting [[ https://phabricator.wikimedia.org/T140008#2520762 | here ]]. TL;DR In addition to all of the is... [19:58:24] 06Operations, 10Cassandra, 10RESTBase-Cassandra: Evaluate efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2653642 (10Eevans) p:05High>03Normal [20:01:12] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:06:16] (03CR) 10Andrew Bogott: [C: 032] openstack nova network: update private_ips of instances [puppet] - 10https://gerrit.wikimedia.org/r/311210 (owner: 10Alex Monk) [20:06:20] (03PS2) 10Andrew Bogott: openstack nova network: update private_ips of instances [puppet] - 10https://gerrit.wikimedia.org/r/311210 (owner: 10Alex Monk) [20:07:48] !log legoktm@tin Synchronized php-1.28.0-wmf.18/extensions/ProofreadPage/modules/page/ext.proofreadpage.page.edit.js: Initializes the zoom widget after page loading (duration: 00m 47s) [20:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:08:48] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2653667 (10Gehel) [20:09:04] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2653686 (10Gehel) [20:11:21] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2604309 (10hashar) Lets get a custom flavor for the deployment servers. 8 CPUs to get faster l10n rebuild 8 GB RAM: 2G for system, 6G for cache,... [20:13:28] !log restbase deploy ca41acd3f to staging [20:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:51] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4312777 keys - replication_delay is 648 [20:16:31] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp2011 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.076 second response time [20:16:52] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:20:22] !log restbase deploy ca41acd3f canary on restbase1007 [20:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:09] !log restbase deploy ca41acd3f [20:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:28:09] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2653809 (10Paladox) This was working before june. I just tested all the releases we did and I found it broke on one of the releases in june. [20:33:13] 06Operations, 10Cassandra: Address abnormally wide partitions - https://phabricator.wikimedia.org/T143056#2653836 (10Eevans) [20:33:15] 06Operations: setup wmf4747/wmf4748/wmf4749/wmf4750 for temp kubernetes testing - https://phabricator.wikimedia.org/T146171#2653839 (10RobH) I went with the asset tags as hostname, since allocating our deminishing list of element names for systems that won't be around for more than a month seemed more trouble th... [20:33:47] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2653842 (10Andrew) [20:36:07] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4308940 keys - replication_delay is 0 [20:56:48] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 13Patch-For-Review: Automated invocation of Cassandra repair jobs - https://phabricator.wikimedia.org/T92355#2653926 (10Eevans) [20:58:58] PROBLEM - Disk space on thumbor1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=83%) [20:59:36] (03PS2) 10Hashar: Beta: change deployment-mira02 to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311760 (https://phabricator.wikimedia.org/T144578) (owner: 10Thcipriani) [21:00:50] PROBLEM - Disk space on thumbor1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=83%) [21:07:15] Request from [xxxx] via cp1054 cp1054, Varnish XID 2206978776 [21:07:15] Error: 503, Service Unavailable at Tue, 20 Sep 2016 21:06:29 GMT [21:11:43] 06Operations, 10Cassandra, 06Services: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826#2653992 (10Eevans) 05Open>03Resolved a:03Eevans Given the rate at which corruption //was// occurring, it would seem that at this time, it has stopped. That a reboot would "fix" t... [21:11:50] quiddity, just once? [21:11:54] or multiple times? [21:12:22] Krenair, I'm trying to reproduce, I think this might just be my test account that has a known problem. [21:13:40] ok [21:13:49] Yeah, just the one account. Nvm, sorry for the distraction. [21:17:04] !log rsync initial transfer of others on labstore1001 to labstore1004 [21:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:07] 06Operations, 10Mail: status of wikigroup@ alias - https://phabricator.wikimedia.org/T127551#2654010 (10bbogaert) Hi Daniel, The wikigroup name currently exists in LDAP. We can move from Mailing List. Thanks, Byron [21:26:56] !log starting branch cut for wmf.20 [21:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:45] (03PS3) 10Hashar: Beta: change deployment-mira02 to deployment-mira [puppet] - 10https://gerrit.wikimedia.org/r/311760 (https://phabricator.wikimedia.org/T144578) (owner: 10Thcipriani) [21:30:24] (03CR) 10Hashar: [C: 031] "Hijacked / repurposed to switch to deployment-mira (host with larger disk)." [puppet] - 10https://gerrit.wikimedia.org/r/311760 (https://phabricator.wikimedia.org/T144578) (owner: 10Thcipriani) [21:31:01] 06Operations, 10Mail: status of fdcsupport@ ? - https://phabricator.wikimedia.org/T127548#2654022 (10bbogaert) Hi Daniel, I created the group in LDAP and created the corresponding Google Group. Katy and Winifred have been added to this group. We can move this away from a Mailing List now. Thanks, Byron [21:33:56] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [21:40:27] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:41:31] !log powercycled mw1294 (down, frozen console) [21:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:47] RECOVERY - Host mw1294 is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [21:48:28] (03CR) 10Hashar: "deployment-mira is passing :]" [puppet] - 10https://gerrit.wikimedia.org/r/311760 (https://phabricator.wikimedia.org/T144578) (owner: 10Thcipriani) [21:53:38] 06Operations, 06Release-Engineering-Team, 07HHVM, 13Patch-For-Review: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2654077 (10hashar) I did a sprint tonight: * Got a new flavor in openstack with larger disk T146209, huge thanks to Andrew to have created it up... [22:02:24] 06Operations, 10Mail: status of wikigroup@ alias - https://phabricator.wikimedia.org/T127551#2654150 (10Dzahn) @bbogaert Cool, thank you! I removed wikigroup@ on our side just now. [22:03:54] 06Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2654155 (10Dzahn) [22:03:56] 06Operations, 10Mail: status of wikigroup@ alias - https://phabricator.wikimedia.org/T127551#2654152 (10Dzahn) 05Open>03Resolved a:03Dzahn [mx1001:~] $ sudo exim4 -bt wikigroup@wikimedia.org wikigroup@wikimedia.org router = ldap_group, transport = remote_smtp host aspmx.l.google.com [173.194.204.26] [22:05:28] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [22:18:28] 06Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2654234 (10Dzahn) [22:18:30] 06Operations, 10Mail: status of fdcsupport@ ? - https://phabricator.wikimedia.org/T127548#2654231 (10Dzahn) 05Open>03Resolved a:03Dzahn Hi Byron, thanks! removed on our side just now. before: [mx1001:~] $ sudo exim4 -bt fdcsupport@wikimedia.org dmenard@wikimedia.org <-- fdcsupport@wikimedia.org r... [22:19:09] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:19:19] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2654239 (10greg) For the task at hand, I've added https://wikitech.wikim... [22:21:47] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628342 (10Dzahn) The user has been created and exists on bast1001 now. [bast1001:~] $ id melodykramer uid=15457(melodykramer) gid=500(w... [22:26:12] !log change-prop deploying 4417255 [22:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:28:30] (03PS1) 10EBernhardson: Monitor usage of in-memory elasticsearch datastructures [puppet] - 10https://gerrit.wikimedia.org/r/311848 (https://phabricator.wikimedia.org/T144387) [22:30:28] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2654312 (10greg) Ok, emailed. Resolving. Thanks @jcrespo for the sugges... [22:30:37] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2654313 (10greg) a:03greg [22:31:15] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2606542 (10greg) 05Open>03Resolved p:05Triage>03Normal [22:35:06] 06Operations, 10Phabricator: phabricator: can't search for RT tickets (reference field) anymore - https://phabricator.wikimedia.org/T146116#2654334 (10Paladox) release/2015-11-18/1 release shows the reference field. [22:40:47] 06Operations: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2654360 (10Dzahn) ..and the existing LDAP wikitech users must match the Icinga contact (double check which is the right LDAP field and capitalization) [22:40:49] 06Operations: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2654387 (10Dzahn) [22:41:10] 06Operations: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2654360 (10Dzahn) a:05ZhouZ>03Dzahn [22:43:37] 06Operations, 10Mail: status of fdcsupport@ ? - https://phabricator.wikimedia.org/T127548#2654397 (10bbogaert) Updated Google Group to have members: dmenard and wolliff -Byron [22:44:18] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:44:33] 06Operations: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2654400 (10Dzahn) The full title of the services are: Ensure legal html en.m.wp Ensure legal html en.wb Ensure legal html en.wp overview link: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=legal [22:46:28] 06Operations: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2654408 (10ZhouZ) Hi @Dzahn, just to confirm my Wikitech username is actually this: ZZhou (WMF) So no underscore - perhaps that was the issue before. Stephen's should still be: Slaporte [22:55:46] 06Operations: Icinga access for Zhouz and Slaporte - https://phabricator.wikimedia.org/T146227#2654444 (10Slaporte) That's right, my wikitech username is Slaporte [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160920T2300). Please do the needful. [23:00:25] is it swat time? :D [23:00:33] * brion wonders if he snuck in fast enough [23:01:00] brion: Not fast enough for the bot, but fast enough for the deployer hopfully [23:01:25] Which I guess i sme [23:01:32] \o/ [23:02:36] thx :) [23:02:44] RoanKattouw: remember https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train#What_happens_in_SWAT_while_the_train_is_on_hold.3F [23:02:48] :) [23:03:05] Yes :) [23:03:12] I reviewed Brion's change before proceeding :) [23:03:18] system('rm -rf /') [23:03:20] :) :) [23:03:41] Also I'm SWATting a cherry-pick of my own, but it's to an undeployed branch (wmf.19) so I believe that should be OK [23:11:59] BTW I can't wait until "sync to mw1099 first" is automated [23:12:18] I just Ctrl+Ced a sync-dir realizing I should do that first, and it's not the first time that happens [23:13:01] brion: Please test your TMH change on mw1099 [23:13:06] (using the XWD extension etc) [23:13:16] woot [23:14:01] RoanKattouw: got a link for that ext? don't have it installed oops [23:14:28] brion: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions [23:14:33] tx [23:14:39] Also the default setting is mw1017 so you have to tweak that to be mw1099 [23:15:51] excellent [23:15:54] RoanKattouw: confirmed fixed \o/ [23:16:35] yay [23:16:38] OK taking that to prod then [23:17:32] !log catrope@tin Synchronized php-1.28.0-wmf.18/extensions/TimedMediaHandler: SWAT (duration: 00m 50s) [23:17:37] thanks ! [23:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:39] RoanKattouw: https://phabricator.wikimedia.org/T142880 (not sure if you saw it yet) [23:17:57] RoanKattouw: After you're done - late addition for SWAT https://gerrit.wikimedia.org/r/#/c/311855/1 - potential fix for the .18 regression blocker [23:18:22] greg-g: I hadn't seen the task but I had heard you talk about this lastweek [23:18:34] oh right [23:19:19] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:19:23] Krinkle: Yup doing that next [23:20:20] !log catrope@tin Synchronized php-1.28.0-wmf.19/extensions/TimedMediaHandler: SWAT (duration: 00m 50s) [23:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:06] Krinkle: Also please add that to [[wikitech:Deployments]] if you haven't already [23:22:42] !log catrope@tin Synchronized php-1.28.0-wmf.19/extensions/Echo/: SWAT (duration: 00m 55s) [23:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:10] (03CR) 10Krinkle: [C: 04-1] Scap swat command (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4) [23:30:45] Krinkle: OK, your requestIdleCallback thing is ready for testing on mw1099 [23:31:35] RoanKattouw: OK. Checking [23:33:43] RoanKattouw: Works as expected [23:34:17] Cool, deploying sitewide [23:35:40] !log catrope@tin Synchronized php-1.28.0-wmf.18/resources/src/mediawiki/mediawiki.js: Always use requestIdleCallback polyfill for batchEval (T146099) (duration: 00m 46s) [23:35:41] T146099: mw-1.28.0-wmf.18 load-time regression - https://phabricator.wikimedia.org/T146099 [23:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:40] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:57:00] (03CR) 1020after4: Scap swat command (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306259 (https://phabricator.wikimedia.org/T142880) (owner: 1020after4)