[00:00:11] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:00:38] Dereckson: Left a comment on T143070. [00:00:38] T143070: Undefined index in Score::generateHTML - https://phabricator.wikimedia.org/T143070 [00:01:00] (03PS2) 10Dereckson: Add autopatrolled and rollbacker user groups to it.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304940 (https://phabricator.wikimedia.org/T142571) [00:01:01] (03CR) 10BryanDavis: [C: 031] "Should probably wait for jerkins to approve too" [puppet] - 10https://gerrit.wikimedia.org/r/304947 (owner: 10Yuvipanda) [00:02:05] ostriches: seen it [00:02:48] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304940 (https://phabricator.wikimedia.org/T142571) (owner: 10Dereckson) [00:03:13] (03Merged) 10jenkins-bot: Add autopatrolled and rollbacker user groups to it.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304940 (https://phabricator.wikimedia.org/T142571) (owner: 10Dereckson) [00:04:04] live on mw1099 [00:05:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:05:29] Works. [00:06:25] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add autopatrolled and rollbacker user groups to it.wikinews (T142571) (duration: 00m 52s) [00:06:26] T142571: Addition of "Autoverificati" (autopatrolled) and "Rollbacker" user groups in it.wikinews - https://phabricator.wikimedia.org/T142571 [00:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:22] (03PS2) 10Dereckson: Set timezone to Europe/Ljubljana on sl. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304423 (https://phabricator.wikimedia.org/T142701) [00:08:35] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304423 (https://phabricator.wikimedia.org/T142701) (owner: 10Dereckson) [00:09:01] (03Merged) 10jenkins-bot: Set timezone to Europe/Ljubljana on sl. projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304423 (https://phabricator.wikimedia.org/T142701) (owner: 10Dereckson) [00:10:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:10:11] live on mw1099 [00:12:28] (03PS1) 10Chad: Remove UTC timezone settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304953 [00:13:01] (03CR) 10Chad: "Considering it's, you know, the default." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304953 (owner: 10Chad) [00:13:39] Works. [00:14:35] Dereckson, I did some digging, my issue happened around 2015-06-25 17:45 - https://phabricator.wikimedia.org/T103886 was created for it [00:14:55] It did take out the wikis briefly but no incident report AFAICT [00:15:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:15:52] Krenair: count how many were on mw1099, fatalmonitor reported 10 of them (down at 4 currently) [00:17:01] (03CR) 10Dereckson: [C: 031] Remove UTC timezone settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304953 (owner: 10Chad) [00:17:05] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set timezone to Europe/Ljubljana on sl. projects (T142701) (duration: 00m 49s) [00:17:06] T142701: System time on sl projects - https://phabricator.wikimedia.org/T142701 [00:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:41] (03CR) 10Chad: [C: 032] Remove UTC timezone settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304953 (owner: 10Chad) [00:19:08] (03Merged) 10jenkins-bot: Remove UTC timezone settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304953 (owner: 10Chad) [00:20:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:21:44] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: remove utc timezone overrides (duration: 00m 48s) [00:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:53] RoanKattouw: by the way, Echo first edit works [00:21:59] :) [00:23:38] Krenair: don't forget to add your fix to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160815T2300 [00:23:53] Dereckson, it wasn't a scheduled fix, I don't think we need to? [00:24:11] we generally add them [00:24:31] (with a +[[gerrit:]] to the first entry) [00:24:56] I imagine that could be a reference of what have been deployed, in addition to a schedule of what to deploy. [00:25:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:28:37] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2555945 (10GWicke) One idea @aaron brought up is to do something like this: - For... [00:28:40] stashbot seems pretty cool [00:30:11] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:30:26] Uploading file to commons:commons via API... [00:30:26] Reading file https://farm8.staticflickr.com/7590/28587591155_744d55ae18_o.jpg [00:30:26] WARNING: Non-JSON response received from server commons:commons; the server may be down. [00:32:04] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for ovasileva - https://phabricator.wikimedia.org/T142502#2555949 (10ovasileva) @RobH wikitech username: OVAsileva access groups: I think researchers, statistics-users and analytics-users makes sense. If they don't, wo... [00:35:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:37:48] Josve05a: if you can repro, debug and chec what the response is [00:39:06] Dereckson: still happenng...I'm using flickrrippper.py on pywikibot...not sure how to debug something like that...I have two instances working on two different flickr feeds. One of the wndows is working the other gets "WARNING: Non-JSON response received from server commons:commons; the server may be down. WARNING: Waiting 120 seconds before retrying." [00:39:09] happening* [00:40:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:40:13] been going on for 112 min [00:40:18] 11-12* [00:43:15] you could ask on #pywikibot for assistance to get this NON-JSON response [00:44:15] I'll do a restart and see if I still get the same respnse first...then I'll brew a batc of coffee and start investigating...thanks [00:45:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:48:00] 06Operations, 10Mail: Emails dropping from Greenhouse to Alan - https://phabricator.wikimedia.org/T142427#2555982 (10bbogaert) Hi Alex, The emails were sent from no-reply@greenhouse.io to alau@wikimedia.org on 8/8/2016 between 11:30 and 12:20 PST. Thanks, Byron [00:48:21] nope still getting it..gah [00:48:51] 06Operations, 10Wikimedia-General-or-Unknown: Backup systems - https://phabricator.wikimedia.org/T20255#2555983 (10Danny_B) [00:49:07] What's the actual response? [00:49:21] (03PS1) 10Yuvipanda: diamond: Sort out some confusion between 'true' & True [puppet] - 10https://gerrit.wikimedia.org/r/304958 [00:50:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:50:39] Reedy: No idea...doesn't say... [00:53:45] no one really online in #pywikibot :/ [00:54:19] (03CR) 10Yuvipanda: [C: 032] diamond: Sort out some confusion between 'true' & True [puppet] - 10https://gerrit.wikimedia.org/r/304958 (owner: 10Yuvipanda) [00:54:27] (03CR) 10Yuvipanda: [C: 032] spaces for the pep8 gods [puppet] - 10https://gerrit.wikimedia.org/r/304947 (owner: 10Yuvipanda) [00:55:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [00:57:01] https://www.irccloud.com/pastebin/R8eA2osm/ [00:57:09] Reedy, Dereckson ^ Something? [01:00:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [01:05:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [01:10:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [01:15:11] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [01:20:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [01:20:51] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: puppet fail [01:21:30] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: puppet fail [01:21:51] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: puppet fail [01:21:52] PROBLEM - puppet last run on labvirt1004 is CRITICAL: CRITICAL: puppet fail [01:22:11] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: puppet fail [01:22:39] that's my aborted puppet run [01:22:41] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: puppet fail [01:22:49] I'll run it explicitly now [01:22:51] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: puppet fail [01:23:50] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:23:51] RECOVERY - puppet last run on labvirt1004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [01:24:10] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [01:24:40] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:24:41] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [01:24:50] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:25:10] PROBLEM - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently [01:25:22] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:25:42] ACKNOWLEDGEMENT - check_ipn_redir on mintaka is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently Jeff_Green will fix tomorrow, not urgent [02:25:28] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.14) (duration: 11m 31s) [02:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:13] jouncebot: help [02:26:47] jouncebot: next [02:26:48] In 12 hour(s) and 33 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160816T1500) [02:28:12] (03PS3) 10Dzahn: Gerrit: Simplify rewrites to avoid mentioning the host unless needed [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [02:31:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Aug 16 02:31:10 UTC 2016 (duration 5m 42s) [02:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:25] (03CR) 10Dzahn: [C: 032] Gerrit: Simplify rewrites to avoid mentioning the host unless needed [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [02:36:28] (03CR) 10Dzahn: "before/after:" [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [02:42:04] (03CR) 10Dzahn: "but https://gerrit.wikimedia.org/tools/hooks/commit-msg is 404 before and after" [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [02:44:16] (03CR) 10Dzahn: "also https://gerrit.wikimedia.org/r/gerrit_ui/undefined.cache.js that does not get that redirect to the hexcode for broken browser detect" [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [02:58:19] (03PS1) 10Dzahn: Revert "Gerrit: Simplify rewrites to avoid mentioning the host unless needed" [puppet] - 10https://gerrit.wikimedia.org/r/304962 [02:59:12] (03CR) 10Dzahn: [C: 032] "the first 2 rules work but the 2 special cases further down are broken by this. before: 302 after: 404" [puppet] - 10https://gerrit.wikimedia.org/r/304962 (owner: 10Dzahn) [03:02:30] (03CR) 10Dzahn: "[tin:~] $ curl -I https://gerrit.wikimedia.org/tools/hooks/commit-msg" [puppet] - 10https://gerrit.wikimedia.org/r/304962 (owner: 10Dzahn) [03:06:58] (03CR) 10Dzahn: "the RedirectMatch lines were fine, the RewriteRule lines were not. we really need to test everything. also i feel like it's easier to over" [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [03:12:36] (03PS2) 10Dzahn: Gerrit: Default $db_host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/304840 (owner: 10Chad) [03:14:01] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/3710/" [puppet] - 10https://gerrit.wikimedia.org/r/304840 (owner: 10Chad) [03:14:20] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [03:16:01] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [03:19:34] (03PS2) 10Dzahn: Gerrit: Minor config tidying to avoid puppet/init inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/304838 (owner: 10Chad) [03:20:22] (03CR) 10Dzahn: [C: 032] Gerrit: Minor config tidying to avoid puppet/init inconsistencies [puppet] - 10https://gerrit.wikimedia.org/r/304838 (owner: 10Chad) [03:21:40] !log lead - restarted apache [03:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:21:49] !log gerrit restarting to apply config change 304838 [03:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:42:07] !log restarted grrrit-wm [03:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:45:08] (03CR) 10Dzahn: "no-op on lead" [puppet] - 10https://gerrit.wikimedia.org/r/302949 (owner: 10Chad) [03:46:16] (03CR) 10Dzahn: "20after4: ack?" [puppet] - 10https://gerrit.wikimedia.org/r/301847 (owner: 10Chad) [03:52:20] (03CR) 10Dzahn: "can you fix "The change could not be rebased due to a conflict during merge."" [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [03:55:13] (03PS15) 10Dzahn: Gerrit: Support having phab commits as links [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [03:56:32] (03PS16) 10Paladox: Gerrit: Support linking to phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) [03:57:16] (03CR) 10Dzahn: [C: 032] Gerrit: Support linking to phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [03:58:44] !log gerrit restarting to apply config change 302229 [03:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:02:29] !log restarted grrrit-wm [04:03:21] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures [04:03:52] (03CR) 10Dzahn: "T75997#2518198" [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [04:07:51] (03PS1) 10Dzahn: gerrit: simply RedirectMatch rules [puppet] - 10https://gerrit.wikimedia.org/r/304964 [04:08:41] (03PS2) 10Dzahn: gerrit: simplify RedirectMatch rules, remove hostname [puppet] - 10https://gerrit.wikimedia.org/r/304964 [04:09:41] (03CR) 10Dzahn: "already tested with apache-fast-test from tin on https://gerrit.wikimedia.org/r/#/c/302980/" [puppet] - 10https://gerrit.wikimedia.org/r/304964 (owner: 10Dzahn) [04:10:02] (03PS3) 10Dzahn: gerrit: simplify RedirectMatch rules, remove hostname [puppet] - 10https://gerrit.wikimedia.org/r/304964 [04:11:56] (03CR) 10Dzahn: [C: 032] gerrit: simplify RedirectMatch rules, remove hostname [puppet] - 10https://gerrit.wikimedia.org/r/304964 (owner: 10Dzahn) [04:19:23] (03CR) 10Dzahn: [C: 032] "the puppet compiler link was unfortunately already deleted, but here's a new one http://puppet-compiler.wmflabs.org/3711/fluorine.eqiad.wm" [puppet] - 10https://gerrit.wikimedia.org/r/299672 (https://phabricator.wikimedia.org/T140313) (owner: 10Thcipriani) [04:19:32] (03PS3) 10Dzahn: Use hiera for udp2log-mw logrotate count [puppet] - 10https://gerrit.wikimedia.org/r/299672 (https://phabricator.wikimedia.org/T140313) (owner: 10Thcipriani) [04:20:58] (03CR) 10MZMcBride: "https://phabricator.wikimedia.org/T75997#2518198" [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [04:23:20] (03CR) 10Dzahn: "ok on fluorine. just:" [puppet] - 10https://gerrit.wikimedia.org/r/299672 (https://phabricator.wikimedia.org/T140313) (owner: 10Thcipriani) [04:30:20] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:43:53] (03PS5) 10Dzahn: icinga,tendril: remove duplicate NameVirtualHost *:80 [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) [04:48:09] (03CR) 10Dzahn: [C: 032] icinga,tendril: remove duplicate NameVirtualHost *:80 [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) (owner: 10Dzahn) [04:48:18] (03PS6) 10Dzahn: icinga,tendril: remove duplicate NameVirtualHost *:80 [puppet] - 10https://gerrit.wikimedia.org/r/297727 (https://phabricator.wikimedia.org/T132661) [05:00:16] (03PS1) 10Dzahn: tendril: remove NameVirtualHost: *:443 [puppet] - 10https://gerrit.wikimedia.org/r/304968 (https://phabricator.wikimedia.org/T132661) [05:08:24] (03CR) 10Dzahn: [C: 032] tendril: remove NameVirtualHost: *:443 [puppet] - 10https://gerrit.wikimedia.org/r/304968 (https://phabricator.wikimedia.org/T132661) (owner: 10Dzahn) [05:08:31] (03PS2) 10Dzahn: tendril: remove NameVirtualHost: *:443 [puppet] - 10https://gerrit.wikimedia.org/r/304968 (https://phabricator.wikimedia.org/T132661) [05:14:12] 06Operations, 13Patch-For-Review: ytterbium, neon and strontium daily cronspam - https://phabricator.wikimedia.org/T132661#2556142 (10Dzahn) Alright, an Apache graceful on neon is now quiet after those 2 merges. And ytterbium and strontium are dead. So that should be resolved now. [05:14:26] 06Operations: ytterbium, neon and strontium daily cronspam - https://phabricator.wikimedia.org/T132661#2556143 (10Dzahn) [05:14:34] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2556146 (10Dzahn) [05:14:36] 06Operations: ytterbium, neon and strontium daily cronspam - https://phabricator.wikimedia.org/T132661#2206046 (10Dzahn) 05Open>03Resolved a:03Dzahn [05:23:16] https://gerrit.wikimedia.org/r/#/c/304678 [05:23:23] I have this in operations/puppet [05:23:33] It would be great if one of ops merge it [05:23:38] (it's for ores) [05:23:54] I cherry-picked it in beta and worked just fine [06:27:15] (03PS1) 10Ladsgroup: Enable ORES review tool in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304975 (https://phabricator.wikimedia.org/T140005) [06:29:41] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] (03PS1) 10Jdrewniak: Bumping portals to master New wikipedia.org layout. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304976 (https://phabricator.wikimedia.org/T140153) [06:35:11] RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 222 bytes in 0.006 second response time [06:35:41] !log restarted hhvm on jobrunner mw1162 (deadlocked) [06:35:43] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 5 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2410847 (10KartikMistry) Except few language pairs (which has their own tasks), every pack... [06:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:36:02] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2556211 (10KartikMistry) [06:36:31] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package apertium (and dependencies) for Jessie - https://phabricator.wikimedia.org/T107306#2410849 (10KartikMistry) [06:36:56] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package apertium (and dependencies) for Jessie - https://phabricator.wikimedia.org/T107306#2426133 (10KartikMistry) [06:38:29] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/294314 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [06:46:01] (03CR) 10Paladox: "Your welcome" [puppet] - 10https://gerrit.wikimedia.org/r/302229 (https://phabricator.wikimedia.org/T76459) (owner: 10Paladox) [06:48:42] (03Draft2) 10Paladox: Bring back ostriches (Chad) change with no "" [puppet] - 10https://gerrit.wikimedia.org/r/304977 [06:48:46] (03Draft1) 10Paladox: Bring back ostriches (Chad) change with no "" [puppet] - 10https://gerrit.wikimedia.org/r/304977 [06:50:27] (03CR) 10Paladox: "This was tested on the gerrit-test install." [puppet] - 10https://gerrit.wikimedia.org/r/302980 (owner: 10Chad) [06:55:45] 06Operations, 06Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#2556249 (10Arrbee) [06:56:00] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package apertium (and dependencies) for Jessie - https://phabricator.wikimedia.org/T107306#2556247 (10Arrbee) 05Open>03Resolved [06:56:41] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:24:17] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add puppetdbquery module [puppet] - 10https://gerrit.wikimedia.org/r/304981 [07:24:19] (03PS1) 10Giuseppe Lavagetto: ssh::client: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304982 [07:24:21] (03PS1) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 [07:25:36] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: add puppetdbquery module [puppet] - 10https://gerrit.wikimedia.org/r/304981 (owner: 10Giuseppe Lavagetto) [07:25:54] (03CR) 10jenkins-bot: [V: 04-1] ssh::client: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304982 (owner: 10Giuseppe Lavagetto) [07:26:52] (03CR) 10jenkins-bot: [V: 04-1] prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (owner: 10Giuseppe Lavagetto) [07:50:24] (03PS2) 10Giuseppe Lavagetto: puppetmaster: add puppetdbquery module [puppet] - 10https://gerrit.wikimedia.org/r/304981 [07:51:39] (03CR) 10Jcrespo: "Could you coordinate with Ariel to make sure most slow reports and dumps are not performed at the same time, as they share resources right" [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) (owner: 10Nemo bis) [07:55:06] (03CR) 10ArielGlenn: "These dates are fine for now. The first dump run of the month does its heavy work during the first few days of the month, and the second r" [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) (owner: 10Nemo bis) [07:56:44] 06Operations, 10Ops-Access-Requests, 10Analytics: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2556320 (10akosiaris) > We won't gain any extra sudo permissions, but this group will be used to grant access to the deploymenet ssh k... [08:03:03] (03PS3) 10Nemo bis: Monthly update of the "slowest" querypages on the English Wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) [08:03:18] (03CR) 10Nemo bis: "Thanks for confirming!" [puppet] - 10https://gerrit.wikimedia.org/r/304696 (https://phabricator.wikimedia.org/T142936) (owner: 10Nemo bis) [08:21:37] (03PS1) 10ArielGlenn: reduce cronspam: filter out dataset rsync error messages about vanishing files [puppet] - 10https://gerrit.wikimedia.org/r/304988 [08:22:59] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2556345 (10Joe) [08:23:43] (03CR) 10ArielGlenn: [C: 032] reduce cronspam: filter out dataset rsync error messages about vanishing files [puppet] - 10https://gerrit.wikimedia.org/r/304988 (owner: 10ArielGlenn) [08:36:56] 06Operations, 10Pybal, 10Traffic: Unhandled pybal ValueError: need more than 1 value to unpack - https://phabricator.wikimedia.org/T143078#2556355 (10ema) [08:40:39] akosiaris: Hey, do you have some time to check this? https://gerrit.wikimedia.org/r/#/c/304678/2 [08:45:59] (03PS2) 10Giuseppe Lavagetto: redis::instance: use specific aof/rdb file names by default [puppet] - 10https://gerrit.wikimedia.org/r/301789 (https://phabricator.wikimedia.org/T134400) [08:52:00] (03CR) 10Alexandros Kosiaris: [C: 032] Citoid: increase the number of redirects to 10 [puppet] - 10https://gerrit.wikimedia.org/r/304845 (https://phabricator.wikimedia.org/T115108) (owner: 10Mobrovac) [08:52:04] (03PS2) 10Alexandros Kosiaris: Citoid: increase the number of redirects to 10 [puppet] - 10https://gerrit.wikimedia.org/r/304845 (https://phabricator.wikimedia.org/T115108) (owner: 10Mobrovac) [08:52:28] (03PS1) 10Gilles: Update gallery image bounding box on svwiki to 150x150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304991 (https://phabricator.wikimedia.org/T113877) [08:56:56] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Enable uwsgi-specific statsd setup [puppet] - 10https://gerrit.wikimedia.org/r/304678 (https://phabricator.wikimedia.org/T141543) (owner: 10Ladsgroup) [08:57:00] (03PS3) 10Alexandros Kosiaris: ores: Enable uwsgi-specific statsd setup [puppet] - 10https://gerrit.wikimedia.org/r/304678 (https://phabricator.wikimedia.org/T141543) (owner: 10Ladsgroup) [08:57:23] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Enable uwsgi-specific statsd setup [puppet] - 10https://gerrit.wikimedia.org/r/304678 (https://phabricator.wikimedia.org/T141543) (owner: 10Ladsgroup) [08:57:24] thanks [08:57:28] (03PS4) 10Alexandros Kosiaris: ores: Enable uwsgi-specific statsd setup [puppet] - 10https://gerrit.wikimedia.org/r/304678 (https://phabricator.wikimedia.org/T141543) (owner: 10Ladsgroup) [08:57:30] (03CR) 10Alexandros Kosiaris: [V: 032] ores: Enable uwsgi-specific statsd setup [puppet] - 10https://gerrit.wikimedia.org/r/304678 (https://phabricator.wikimedia.org/T141543) (owner: 10Ladsgroup) [08:57:50] Amir1: done. pretty nice [08:58:03] Thanks! [08:59:15] (03PS4) 10Ema: common VCL: use FQDN for backend naming [puppet] - 10https://gerrit.wikimedia.org/r/276529 (https://phabricator.wikimedia.org/T138546) (owner: 10BBlack) [08:59:25] (03CR) 10Ema: [C: 032 V: 032] common VCL: use FQDN for backend naming [puppet] - 10https://gerrit.wikimedia.org/r/276529 (https://phabricator.wikimedia.org/T138546) (owner: 10BBlack) [09:07:43] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [09:09:43] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:09:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This patch will be fine in production, as all modules that use a redis::instance and set "appendonly" to true then go on to define an 'app" [puppet] - 10https://gerrit.wikimedia.org/r/301789 (https://phabricator.wikimedia.org/T134400) (owner: 10Giuseppe Lavagetto) [09:10:42] 06Operations, 10Mail: Emails dropping from Greenhouse to Alan - https://phabricator.wikimedia.org/T142427#2556423 (10akosiaris) Hello @bbogaert, I 've looked at the logs. There have been no emails (failed, successful or otherwise) for alau@wikimedia.org from greenhouse in that timeframe (or any other timefram... [09:10:51] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [09:12:50] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [09:13:03] (03PS1) 10Phedenskog: Enable PerformanceInspector extension for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304992 [09:15:10] (03CR) 10Giuseppe Lavagetto: "I just found out that at least on tools-proxy-01 the appendfilename was declared in the exact form we're adopting here, unpuppetized, in /" [puppet] - 10https://gerrit.wikimedia.org/r/301789 (https://phabricator.wikimedia.org/T134400) (owner: 10Giuseppe Lavagetto) [09:16:21] PROBLEM - Varnishkafka log producer on cp1058 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [09:17:02] that's me ^ [09:18:20] RECOVERY - Varnishkafka log producer on cp1058 is OK: PROCS OK: 1 process with command name varnishkafka [09:18:41] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [1000.0] [09:20:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] Maps - ensure Osmosis is only installed after JRE is available (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304806 (https://phabricator.wikimedia.org/T142977) (owner: 10Gehel) [09:20:32] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 2 failures [09:20:40] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [09:22:21] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 2 failures [09:22:32] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:22:40] (03Abandoned) 10Gehel: Maps - ensure Osmosis is only installed after JRE is available [puppet] - 10https://gerrit.wikimedia.org/r/304806 (https://phabricator.wikimedia.org/T142977) (owner: 10Gehel) [09:24:20] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:32] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 2 failures [09:26:31] PROBLEM - Varnishkafka log producer on cp3007 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [09:26:31] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [1000.0] [09:26:31] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 2 failures [09:26:32] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:27:03] (03PS1) 10Giuseppe Lavagetto: dynamicproxy: puppetize appendfilename setting [puppet] - 10https://gerrit.wikimedia.org/r/304994 [09:27:10] PROBLEM - Varnish HTTP misc-backend - port 3128 on cp3007 is CRITICAL: Connection refused [09:27:11] PROBLEM - Varnish HTTP misc-frontend - port 3127 on cp3007 is CRITICAL: Connection refused [09:27:41] PROBLEM - Varnish HTTP misc-frontend - port 80 on cp3007 is CRITICAL: Connection refused [09:28:04] (03CR) 10jenkins-bot: [V: 04-1] dynamicproxy: puppetize appendfilename setting [puppet] - 10https://gerrit.wikimedia.org/r/304994 (owner: 10Giuseppe Lavagetto) [09:28:31] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Puppet has 2 failures [09:28:31] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:29:11] RECOVERY - Varnish HTTP misc-backend - port 3128 on cp3007 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.174 second response time [09:29:11] RECOVERY - Varnish HTTP misc-frontend - port 3127 on cp3007 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.173 second response time [09:29:51] RECOVERY - Varnish HTTP misc-frontend - port 80 on cp3007 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.168 second response time [09:30:12] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet has 2 failures [09:30:40] RECOVERY - Varnishkafka log producer on cp3007 is OK: PROCS OK: 1 process with command name varnishkafka [09:30:41] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:30:52] sorry for the spam [09:32:12] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:34:11] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Puppet has 2 failures [09:35:11] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 2 failures [09:36:02] PROBLEM - Varnishkafka log producer on cp2009 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [09:36:11] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:31] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 2 failures [09:37:12] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:02] RECOVERY - Varnishkafka log producer on cp2009 is OK: PROCS OK: 1 process with command name varnishkafka [09:38:20] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 2 failures [09:38:31] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:40:11] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:42:11] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 2 failures [09:44:11] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:22] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 2 failures [09:54:01] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet has 1 failures [09:54:22] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:55:51] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [09:56:01] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:57:50] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:00:21] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 2 failures [10:01:40] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Puppet has 2 failures [10:02:05] <_joe_> uh, what is happening? [10:02:20] _joe_: that's me merging https://gerrit.wikimedia.org/r/#/c/276529/ [10:02:22] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:02:48] I've tried reducing puppetfails spam by doing a 2x puppet agent run with salt, but it didn't really work [10:03:40] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:04:38] <_joe_> no I mean cp3007 [10:04:45] <_joe_> it crashed I think [10:05:36] really? [10:05:39] looks up to me [10:05:55] <_joe_> it crashed and recovered according to icinga [10:06:01] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 2 failures [10:06:34] <_joe_> systemctl status varnish confirms [10:07:00] <_joe_> it's half an hour ago [10:07:08] <_joe_> I wasn't looking at IRC atm :) [10:07:19] oh yeah similar problem on cp3003 now [10:07:52] I'm going to depool cp3003 so we can look into what's going on [10:08:00] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:08:11] PROBLEM - Varnishkafka log producer on cp3003 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [10:08:42] PROBLEM - Varnish HTTP maps-frontend - port 3127 on cp3003 is CRITICAL: Connection refused [10:09:01] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [10:09:20] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp3003 is CRITICAL: Connection refused [10:10:11] PROBLEM - Varnish HTTP misc-frontend - port 3127 on cp2025 is CRITICAL: Connection refused [10:10:30] PROBLEM - Varnish HTTP misc-frontend - port 80 on cp2025 is CRITICAL: Connection refused [10:10:32] PROBLEM - Varnish HTTP misc-backend - port 3128 on cp2025 is CRITICAL: Connection refused [10:11:00] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:11:01] PROBLEM - Varnishkafka log producer on cp2025 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [10:11:42] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 2 failures [10:12:01] RECOVERY - Varnish HTTP misc-frontend - port 3127 on cp2025 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.075 second response time [10:12:26] PROBLEM - LVS HTTP IPv6 on maps-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [10:12:28] RECOVERY - Varnish HTTP misc-frontend - port 80 on cp2025 is OK: HTTP OK: HTTP/1.1 200 OK - 320 bytes in 0.073 second response time [10:12:40] RECOVERY - Varnish HTTP misc-backend - port 3128 on cp2025 is OK: HTTP OK: HTTP/1.1 200 OK - 174 bytes in 0.075 second response time [10:12:42] PROBLEM - Varnish HTTP maps-frontend - port 3127 on cp3005 is CRITICAL: Connection refused [10:12:42] PROBLEM - Varnish HTTP maps-frontend - port 3127 on cp3006 is CRITICAL: Connection refused [10:12:42] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp3005 is CRITICAL: Connection refused [10:13:02] RECOVERY - Varnishkafka log producer on cp2025 is OK: PROCS OK: 1 process with command name varnishkafka [10:13:11] <_joe_> ema: what is happening with maps varnishes? they're crashing one after another [10:13:19] <_joe_> maybe something caused by your change, maybe not [10:13:20] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp3006 is CRITICAL: Connection refused [10:13:25] _joe_: I'm trying to find out [10:13:34] <_joe_> maybe stop applyign it for now [10:13:34] in the meantime I've stopped the salted puppet run [10:13:41] <_joe_> heh, yes [10:13:50] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:13:50] PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - mapslb6_80 - Could not depool server cp3006.esams.wmnet because of too many down!: mapslb_443 - Could not depool server cp3006.esams.wmnet because of too many down!: mapslb6_443 - Could not depool server cp3004.esams.wmnet because of too many down!: mapslb_80 - Could not depool server cp3004.esams.wmnet because of too many down! [10:14:00] PROBLEM - PyBal backends health check on lvs3002 is CRITICAL: PYBAL CRITICAL - mapslb6_80 - Could not depool server cp3006.esams.wmnet because of too many down!: mapslb_443 - Could not depool server cp3004.esams.wmnet because of too many down!: mapslb6_443 - Could not depool server cp3004.esams.wmnet because of too many down!: mapslb_80 - Could not depool server cp3006.esams.wmnet because of too many down! [10:14:05] er, maybe revert ? on all but one ? [10:14:06] PROBLEM - LVS HTTPS IPv6 on maps-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.368 second response time [10:14:07] PROBLEM - Varnishkafka log producer on cp3005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [10:14:08] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [10:14:08] I'll fix them manually, it looks like this is enough: [10:14:09] systemctl stop varnish ; systemctl start varnish ; systemctl stop varnish-frontend.service ; systemctl start varnish-frontend [10:14:10] PROBLEM - Varnishkafka log producer on cp3006 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [10:15:33] <_joe_> ema: doing on cp3005 [10:15:36] <_joe_> to confirm [10:15:53] <_joe_> seems like the reload kills varnish [10:16:06] _joe_: OK, I've left cp3003 depooled in broken state to investigate the actual cause later [10:16:07] RECOVERY - LVS HTTPS IPv6 on maps-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 2751 bytes in 0.353 second response time [10:16:10] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:16:30] <_joe_> ema: ok [10:16:51] RECOVERY - Varnish HTTP maps-frontend - port 3127 on cp3005 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.168 second response time [10:16:51] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp3005 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.174 second response time [10:16:51] PROBLEM - Varnish HTTP misc-frontend - port 3127 on cp1061 is CRITICAL: Connection refused [10:17:11] PROBLEM - Varnish HTTP misc-frontend - port 80 on cp1061 is CRITICAL: Connection refused [10:17:12] PROBLEM - Varnishkafka log producer on cp1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [10:17:44] <_joe_> ema: I'll fix 3006 as well [10:18:01] _joe_: thanks, I've done cp1060 and cp1061 just now [10:18:17] <_joe_> it's pretty strange, 3006 [10:18:22] <_joe_> is up according do systemd [10:18:27] I do not see production impact; the good news is that paging works after the maintenance [10:18:36] RECOVERY - LVS HTTP IPv6 on maps-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 390 bytes in 0.172 second response time [10:18:39] _joe_: that's the case everywhere I think [10:18:57] <_joe_> ema: yeah but on 3006 only the frontend crashed [10:19:01] RECOVERY - Varnish HTTP misc-frontend - port 3127 on cp1061 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.002 second response time [10:19:13] RECOVERY - Varnish HTTP misc-frontend - port 80 on cp1061 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.006 second response time [10:19:28] keeps on crashing in fact [10:19:30] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp3006 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.171 second response time [10:19:39] <_joe_> akosiaris: where is it keeping crashing? [10:19:40] PROBLEM - Varnishkafka log producer on cp1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [10:19:43] _joe_: probably it's similar to https://phabricator.wikimedia.org/T142810#2547422 [10:19:55] _joe_: cp3006.. at least I assume. it has a 26s age now [10:20:01] RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy [10:20:03] <_joe_> akosiaris: just restarted it [10:20:22] RECOVERY - Varnishkafka log producer on cp3006 is OK: PROCS OK: 1 process with command name varnishkafka [10:20:24] ok then [10:20:43] <_joe_> so esams is now ok, apart from 3003 [10:20:51] RECOVERY - Varnish HTTP maps-frontend - port 3127 on cp3006 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.168 second response time [10:21:04] ah. journal has a stack trace [10:21:20] Message: [10:21:20] Loading VMOD header from /usr/lib/x86_64-linux-gnu/varnish/vmods/libvmod_header.so: [10:21:20] This is no longer the same file seen by the VCL-compiler. [10:21:34] akosiaris: yes, that's T142810 [10:21:34] T142810: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810 [10:21:45] ok [10:21:59] I think I know what's going on, I did a rolling restart of maps backends on Friday, but not the frontends [10:22:10] RECOVERY - PyBal backends health check on lvs3002 is OK: PYBAL OK - All pools are healthy [10:22:15] that would explain why _joe_ found the frontend down on cp3006 but not the backend [10:24:02] ok yes. Only misc and maps seem to be affected, so very likely this is due to T142810 [10:24:03] T142810: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810 [10:25:19] (03PS2) 10Gehel: remove nobelium from puppet and install-server [puppet] - 10https://gerrit.wikimedia.org/r/304112 (https://phabricator.wikimedia.org/T142581) (owner: 10Dzahn) [10:28:04] confirmed, this is due to the vmod problem. Fixing cp3003 now. [10:28:25] ah, cool [10:28:32] thanks _joe_, akosiaris and jynus [10:28:41] RECOVERY - Varnish HTTP maps-frontend - port 3127 on cp3003 is OK: HTTP OK: HTTP/1.1 200 OK - 319 bytes in 0.174 second response time [10:29:21] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp3003 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.169 second response time [10:30:22] RECOVERY - Varnishkafka log producer on cp3003 is OK: PROCS OK: 1 process with command name varnishkafka [10:30:22] (03CR) 10Gehel: [C: 032] remove nobelium from puppet and install-server [puppet] - 10https://gerrit.wikimedia.org/r/304112 (https://phabricator.wikimedia.org/T142581) (owner: 10Dzahn) [10:30:58] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2556618 (10akosiaris) Any updates on this one ? [10:31:31] 06Operations, 05codfw-rollout: Scale up and out our puppetmaster infrastructure - https://phabricator.wikimedia.org/T98128#2556627 (10akosiaris) [10:31:32] RECOVERY - Varnishkafka log producer on cp1060 is OK: PROCS OK: 1 process with command name varnishkafka [10:31:33] 06Operations, 13Patch-For-Review: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#2556625 (10akosiaris) 05Open>03Resolved This is done. We 've even upgraded this quarter to puppet 3.8 and ruby 2.1 [10:33:02] (03PS2) 10Alexandros Kosiaris: Point eqiad url-downloader to codfw [dns] - 10https://gerrit.wikimedia.org/r/304210 (https://phabricator.wikimedia.org/T134496) [10:33:07] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Point eqiad url-downloader to codfw [dns] - 10https://gerrit.wikimedia.org/r/304210 (https://phabricator.wikimedia.org/T134496) (owner: 10Alexandros Kosiaris) [10:34:11] RECOVERY - Varnishkafka log producer on cp3005 is OK: PROCS OK: 1 process with command name varnishkafka [10:35:11] RECOVERY - Varnishkafka log producer on cp1061 is OK: PROCS OK: 1 process with command name varnishkafka [10:35:13] (03PS2) 10Alexandros Kosiaris: Introduce aluminium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/304211 (https://phabricator.wikimedia.org/T134496) [10:35:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce aluminium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/304211 (https://phabricator.wikimedia.org/T134496) (owner: 10Alexandros Kosiaris) [10:36:13] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [10:48:03] I'm gonna go for a rolling restart of v4 varnishes to fix the vmod issue [10:52:27] !log rolling restart of v4 varnishes (T142810) [10:52:27] T142810: varnishd: Assert error in smp_oc_getobj(), storage/storage_persistent_silo.c line 417 - https://phabricator.wikimedia.org/T142810 [10:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:57:21] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: puppet fail [11:13:21] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures [11:13:29] re-enabled puppet on cp4019 ^ [11:15:21] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:01] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:30:18] (03PS1) 10Muehlenhoff: Update to 4.4.18 [debs/linux44] - 10https://gerrit.wikimedia.org/r/305008 [11:35:33] (03PS2) 10Muehlenhoff: Update to 4.4.18 [debs/linux44] - 10https://gerrit.wikimedia.org/r/305008 [11:40:49] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.18 [debs/linux44] - 10https://gerrit.wikimedia.org/r/305008 (owner: 10Muehlenhoff) [11:42:04] (03PS1) 10Giuseppe Lavagetto: admin: add some dotfiles for myself [puppet] - 10https://gerrit.wikimedia.org/r/305011 [11:44:29] (03PS2) 10Giuseppe Lavagetto: admin: add some dotfiles for myself [puppet] - 10https://gerrit.wikimedia.org/r/305011 [11:44:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: add some dotfiles for myself [puppet] - 10https://gerrit.wikimedia.org/r/305011 (owner: 10Giuseppe Lavagetto) [11:49:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [11:55:22] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:13:11] !log re-enabling puppet on cache hosts (T138546) [12:13:12] T138546: Backend naming in VCL needs to use fqdn+port - https://phabricator.wikimedia.org/T138546 [12:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:25] (03PS1) 10Addshore: Add Collection render note for articles & rdf2latex [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305012 (https://phabricator.wikimedia.org/T135613) [12:17:24] (03CR) 10Alexandros Kosiaris: [C: 031] ssh::client: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304982 (owner: 10Giuseppe Lavagetto) [12:19:52] (03Draft1) 10Addshore: Enable mention status notifications on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304607 [12:19:54] (03Draft1) 10Addshore: Enable mention status notifications everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304608 [12:20:38] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: add puppetdbquery module [puppet] - 10https://gerrit.wikimedia.org/r/304981 (owner: 10Giuseppe Lavagetto) [12:21:28] (03PS2) 10Addshore: wmgEchoMentionStatusNotifications true for test/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302898 (https://phabricator.wikimedia.org/T141995) [12:21:34] (03PS2) 10Addshore: Enable mention status notifications on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304607 [12:21:41] (03PS2) 10Addshore: Enable mention status notifications everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304608 [12:25:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] sshknowngen: add puppetdb-compatible version (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304485 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [12:26:23] (03PS8) 10Addshore: Add simple-json-datasource plugin to labs grafana [puppet] - 10https://gerrit.wikimedia.org/r/302119 (https://phabricator.wikimedia.org/T141636) [12:26:31] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint: Reclaim nobelium - https://phabricator.wikimedia.org/T142581#2556875 (10Gehel) a:03Gehel [12:27:17] (03CR) 10Alexandros Kosiaris: "when should we merge this ?" [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [12:31:45] <_joe_> akosiaris: heh sorry, I forgot to abandon that change [12:35:51] _joe_: no worries [12:38:31] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Merging this" [puppet] - 10https://gerrit.wikimedia.org/r/304471 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [12:38:36] (03PS2) 10Alexandros Kosiaris: Admin: Add Parsoid deployers to the deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/304471 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [12:38:40] (03CR) 10Alexandros Kosiaris: [V: 032] Admin: Add Parsoid deployers to the deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/304471 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [12:40:19] (03CR) 10Mobrovac: [C: 04-1] "Not ready to go until T142990 is fixed" [puppet] - 10https://gerrit.wikimedia.org/r/304470 (https://phabricator.wikimedia.org/T120103) (owner: 10Mobrovac) [12:40:25] thnx akosiaris! [12:40:31] now i can test in beta :) [12:41:12] (03PS1) 10Alexandros Kosiaris: url-downloader: Remove the role from chromium/hydrogen [puppet] - 10https://gerrit.wikimedia.org/r/305013 (https://phabricator.wikimedia.org/T134496) [12:41:14] (03PS1) 10Alexandros Kosiaris: Introduce aluminium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/305014 (https://phabricator.wikimedia.org/T134496) [12:41:33] we should enable the group on wtp boxes though [12:41:50] but that should happen on it's own [12:45:02] i'm getting: Server („https://wikimedia.org/api/rest_“) hlásí: „Cannot get mml. Server problem.“ for [12:45:12] hlasi == says [12:45:20] ema: Just to check... T138546 also changed the related graphite metrics names? [12:45:21] T138546: Backend naming in VCL needs to use fqdn+port - https://phabricator.wikimedia.org/T138546 [12:46:58] Danny_B: hmm, which page ? [12:47:20] akosiaris: https://cs.wikibooks.org/wiki/Integrov%C3%A1n%C3%AD/V%C3%BDpo%C4%8Det_re%C3%A1ln%C3%BDch_integr%C3%A1l%C5%AF_pomoc%C3%AD_reziduov%C3%A9_v%C4%9Bty [12:47:57] gehel: that commit changes the varnish backend names in VCL, I don't think it should change anything on the graphite side [12:48:05] ema: I see new metrics in graphite that appear at the same time as the merge of https://gerrit.wikimedia.org/r/#/c/276529 [12:48:13] akosiaris: i am even surprised by totally indescriptive error message, consider the weird url as well [12:48:33] ema: I was reading the VCL, but I'm completely unsure of what I understand there... [12:48:40] gehel: do you have an example of the new metric? [12:49:14] ema: https://graphite.wikimedia.org/S/Bk [12:49:34] new metric: varnish.eqiad.backends.be_wdqs1001_eqiad_wmnet.GET.p99 [12:49:58] Danny_B: yeah... same here... it does not help much [12:50:06] ema: there is ~3h of overlap between the old and the new metric [12:50:38] (03PS3) 10Addshore: wmgEchoMentionStatusNotifications true for test/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302898 (https://phabricator.wikimedia.org/T141995) [12:50:42] gehel: oh yeah that looks like it [12:52:14] ema: I don't have the full context, but it seems that this rename probably make sense also on graphite. So time to update a few dashboards? [12:53:43] !log Put a better workaround for T132839 in place: Only remove property pairs with context = "item". This keeps ref and qualifier pairs for ext ids intact. [12:53:46] T132839: Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [12:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:54:37] gehel: probably, thanks for pointing this out! [12:55:27] ema: thanks! I have an icinga alert on WDQS based on those metrics. I wanted to make sure we are going to keep this new naming before changing it... [12:56:14] bblack: the backend naming change seems to also reflect on graphite metrics ^ [12:58:43] gehel: I think we're going to keep the new naming unless something breaks [12:59:24] ema: It most probably make sense, even if it is just to keep things aligned... [13:02:24] 06Operations, 07Puppet, 10Monitoring: Puppet agent icinga checks need better logic - https://phabricator.wikimedia.org/T143099#2556950 (10BBlack) [13:02:59] ema: yeah [13:03:22] gehel: what do you alert on from those graphite metrics? [13:04:12] bblack: alert to wdqs admins on increased response time [13:05:11] that seems like it would get noisy over time? isn't response time subject to how complex users' queries are? [13:05:17] bblack: I know it is not a very stable check. [13:06:15] we have a fairly high threshold for warning / critical (2' / 5'). We did have a major slowdown sometime ago that went undetected for too long, so we added that one [13:16:24] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2551996 (10BBlack) 30 minutes isn't really reasonable, and neither is spamming more purge traffic. If there's a constant risk of t... [13:18:22] (03PS1) 10Gehel: WDQS - fix icinga graphite check, metric has been renamed [puppet] - 10https://gerrit.wikimedia.org/r/305020 (https://phabricator.wikimedia.org/T138546) [13:19:58] (03PS3) 10Ottomata: Mirror main-eqiad into main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/304928 (https://phabricator.wikimedia.org/T134184) [13:20:00] (03PS2) 10Gehel: WDQS - fix icinga graphite check, metric has been renamed [puppet] - 10https://gerrit.wikimedia.org/r/305020 (https://phabricator.wikimedia.org/T138546) [13:22:33] (03PS3) 10Gehel: WDQS - fix icinga graphite check, metric has been renamed [puppet] - 10https://gerrit.wikimedia.org/r/305020 (https://phabricator.wikimedia.org/T138546) [13:24:37] (03PS3) 10Giuseppe Lavagetto: puppetmaster: add puppetdbquery module [puppet] - 10https://gerrit.wikimedia.org/r/304981 [13:30:30] (03PS3) 10Addshore: Enable mention status notifications on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304607 (https://phabricator.wikimedia.org/T143100) [13:30:51] (03PS3) 10Addshore: Enable mention status notifications everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304608 (https://phabricator.wikimedia.org/T143101) [13:30:55] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add puppetdbquery module [puppet] - 10https://gerrit.wikimedia.org/r/304981 (owner: 10Giuseppe Lavagetto) [13:38:25] Danny_B: seems like a bug. The error is TeX parse error: \\nolimits is allowed only on operators [13:38:37] we should file it [13:39:27] (03PS4) 10Gehel: WDQS - fix icinga graphite check, metric has been renamed [puppet] - 10https://gerrit.wikimedia.org/r/305020 (https://phabricator.wikimedia.org/T138546) [13:43:43] (03PS2) 10Alexandros Kosiaris: url-downloader: Remove the role from chromium/hydrogen [puppet] - 10https://gerrit.wikimedia.org/r/305013 (https://phabricator.wikimedia.org/T134496) [13:43:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "chromium is effectively drained, merging" [puppet] - 10https://gerrit.wikimedia.org/r/305013 (https://phabricator.wikimedia.org/T134496) (owner: 10Alexandros Kosiaris) [13:45:40] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [13:51:29] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [13:53:21] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [14:15:56] godog: there is a wish to add ~1000 more time series from maps to graphite. Any idea if we have capacity? [14:30:02] (03CR) 10Anomie: [C: 031] Remove $wgDisableAuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303939 (owner: 10Gergő Tisza) [14:34:56] (03PS11) 10Mattflaschen: Change login cookies (for 'Keep me logged in') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) [14:35:29] (03CR) 10Mattflaschen: Change login cookies (for 'Keep me logged in') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) (owner: 10Mattflaschen) [15:00:05] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160816T1500). Please do the needful. [15:00:05] matt_flaschen, Amir1, jan_drewniak, and mlitn: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:19] jouncebot, present. [15:00:39] o/ [15:00:42] Hey mlitn, Amir1 [15:00:54] (03CR) 1020after4: [C: 031] Phab: Remove unused phab-deploy-key files [puppet] - 10https://gerrit.wikimedia.org/r/301847 (owner: 10Chad) [15:01:17] (03CR) 1020after4: "dzahn: ack!" [puppet] - 10https://gerrit.wikimedia.org/r/301847 (owner: 10Chad) [15:01:44] I can SWAT, unless you were getting ready to do so matt_flaschen [15:02:25] thcipriani, if you're available to do it, I'd appreciate it. I wasn't planning to, but I can. [15:02:47] (03PS1) 10Muehlenhoff: package_builder: Add maven-repo-helper to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305026 [15:02:53] matt_flaschen: I will SWAT, np. [15:03:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) (owner: 10Mattflaschen) [15:04:02] (03Merged) 10jenkins-bot: Change login cookies (for 'Keep me logged in') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) (owner: 10Mattflaschen) [15:05:37] I'm around now [15:05:52] I admit I forgot I have a patch [15:05:55] matt_flaschen: https://gerrit.wikimedia.org/r/#/c/230954/11 is live on mw1099, check please [15:06:02] I'm not so late though :D [15:06:07] matt_flaschen: o/ [15:07:30] Amir1: remind me of the procedure for enabling ores on new wikis? [15:08:01] thcipriani: hey, 1- create two table 2- run two maintenance scripts [15:08:07] let e grab one of old ones [15:08:50] https://gerrit.wikimedia.org/r/#/c/298715/ [15:09:13] thcipriani: ^ [15:09:22] Amir1: ack, thank you [15:11:22] I’m here [15:11:38] thcipriani, yeah, looks good. One of the cookies (login.wikimedia.org) isn't updated yet, but I think that's just an artifact of how the browser extension works. [15:11:51] The other one is working right. [15:11:59] 06Operations, 10Traffic: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2557297 (10BBlack) This traffic is still going strong, and we still don't have a solid explanation. To recap some further investigation since: The common pattern is these UAs are doing a... [15:12:04] matt_flaschen: ack, pushing live everywhere [15:14:28] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:230954|Change login cookies (for "Keep me logged in") to a one year expiry. (T68699)]] (duration: 01m 08s) [15:14:29] T68699: Increase "remember me" login cookie expiry from 30 days to 1 year on Wikimedia wikis - https://phabricator.wikimedia.org/T68699 [15:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:34] ^ matt_flaschen live everywhere [15:15:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304975 (https://phabricator.wikimedia.org/T140005) (owner: 10Ladsgroup) [15:15:06] (03CR) 10Gehel: [C: 031] "lgtm, and makes me learn something..." [puppet] - 10https://gerrit.wikimedia.org/r/305026 (owner: 10Muehlenhoff) [15:15:27] (03CR) 10Muehlenhoff: [C: 032] package_builder: Add maven-repo-helper to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305026 (owner: 10Muehlenhoff) [15:15:29] Thanks thcipriani, testing. [15:15:46] (03PS2) 10Thcipriani: Enable ORES review tool in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304975 (https://phabricator.wikimedia.org/T140005) (owner: 10Ladsgroup) [15:16:02] (03CR) 10Thcipriani: Enable ORES review tool in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304975 (https://phabricator.wikimedia.org/T140005) (owner: 10Ladsgroup) [15:16:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304975 (https://phabricator.wikimedia.org/T140005) (owner: 10Ladsgroup) [15:16:41] gah, gerrit. Wouldn't let me rebase, +2'd then said couldn't merge until I rebased :( [15:16:52] (03Merged) 10jenkins-bot: Enable ORES review tool in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304975 (https://phabricator.wikimedia.org/T140005) (owner: 10Ladsgroup) [15:19:14] thcipriani, it looks good except one of the cookies is still a shorter duration. Investigating. [15:19:26] ack, thank you [15:22:04] Amir1: running maintenance scripts now [15:22:20] awesome! [15:23:09] Amir1: live on mw1099, check there please [15:23:22] (03PS1) 10BBlack: Experimental error handling for buggy Win+Chrome/41 [puppet] - 10https://gerrit.wikimedia.org/r/305029 (https://phabricator.wikimedia.org/T141786) [15:23:26] on it [15:25:12] thcipriani: it's okay! [15:25:22] Amir1: ack, going live everywhere [15:27:23] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:304975|Enable ORES review tool in plwiki (T140005)]] (duration: 00m 55s) [15:27:24] PROBLEM - Freshness of OCSP Stapling files on cp4006 is CRITICAL: Timeout while attempting connection [15:27:24] PROBLEM - configured eth on cp4006 is CRITICAL: Timeout while attempting connection [15:27:24] T140005: Deploy ORES review tool in Polish Wikipedia - https://phabricator.wikimedia.org/T140005 [15:27:24] PROBLEM - Varnish traffic logger - varnishxcps on cp4006 is CRITICAL: Timeout while attempting connection [15:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:34] Amir1: live everywhere [15:27:36] PROBLEM - dhclient process on cp4006 is CRITICAL: Timeout while attempting connection [15:27:36] PROBLEM - Varnish traffic logger - varnishreqstats on cp4006 is CRITICAL: Timeout while attempting connection [15:27:36] PROBLEM - Freshness of zerofetch successful run file on cp4006 is CRITICAL: Timeout while attempting connection [15:27:54] PROBLEM - HTTPS on cp4006 is CRITICAL: Return code of 255 is out of bounds [15:27:54] PROBLEM - Varnish traffic logger - varnishmedia on cp4006 is CRITICAL: Timeout while attempting connection [15:27:54] PROBLEM - Confd vcl based reload on cp4006 is CRITICAL: Timeout while attempting connection [15:27:54] PROBLEM - MD RAID on cp4006 is CRITICAL: Timeout while attempting connection [15:27:55] PROBLEM - Varnish traffic logger - varnishstatsd on cp4006 is CRITICAL: Timeout while attempting connection [15:27:55] PROBLEM - puppet last run on cp4006 is CRITICAL: Timeout while attempting connection [15:28:05] awesome, is cp4006 related to us? [15:28:14] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4006 is CRITICAL: Connection timed out [15:28:14] PROBLEM - salt-minion processes on cp4006 is CRITICAL: Timeout while attempting connection [15:28:23] I don't think so. [15:28:24] PROBLEM - SSH on cp4006 is CRITICAL: Connection timed out [15:28:24] PROBLEM - Varnish traffic logger - varnishxcache on cp4006 is CRITICAL: Timeout while attempting connection [15:28:24] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4006 is CRITICAL: Timeout while attempting connection [15:28:34] PROBLEM - traffic-pool service on cp4006 is CRITICAL: Timeout while attempting connection [15:28:44] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp4006 is CRITICAL: Timeout while attempting connection [15:28:54] PROBLEM - DPKG on cp4006 is CRITICAL: Timeout while attempting connection [15:28:54] PROBLEM - Varnish HTCP daemon on cp4006 is CRITICAL: Timeout while attempting connection [15:28:54] PROBLEM - Varnishkafka log producer on cp4006 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:28:55] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4006 is CRITICAL: Connection timed out [15:29:05] PROBLEM - Disk space on cp4006 is CRITICAL: Timeout while attempting connection [15:29:14] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4006 is CRITICAL: Connection timed out [15:29:25] PROBLEM - confd service on cp4006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:29:31] okay [15:29:41] this looks worrying though [15:29:49] it does [15:30:08] it's just a single cache machine, which has many icinga checks to fail if it crashes [15:30:23] everything is fine [15:30:30] ah, ok, thanks :) [15:30:33] yes, it is not easy to do dynamic checks on icinga [15:30:47] and better spamming than not having them [15:32:04] (03PS15) 10BryanDavis: [WIP] Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) [15:33:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304976 (https://phabricator.wikimedia.org/T140153) (owner: 10Jdrewniak) [15:33:08] (03PS2) 10Thcipriani: Bumping portals to master New wikipedia.org layout. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304976 (https://phabricator.wikimedia.org/T140153) (owner: 10Jdrewniak) [15:33:15] (03CR) 10Thcipriani: Bumping portals to master New wikipedia.org layout. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304976 (https://phabricator.wikimedia.org/T140153) (owner: 10Jdrewniak) [15:33:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304976 (https://phabricator.wikimedia.org/T140153) (owner: 10Jdrewniak) [15:33:55] (03Merged) 10jenkins-bot: Bumping portals to master New wikipedia.org layout. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304976 (https://phabricator.wikimedia.org/T140153) (owner: 10Jdrewniak) [15:34:06] !log depooling cp4006 [15:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:24] RECOVERY - Freshness of OCSP Stapling files on cp4006 is OK: OK [15:35:24] RECOVERY - configured eth on cp4006 is OK: OK - interfaces up [15:35:25] RECOVERY - Varnish traffic logger - varnishxcps on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcps, UID = 0 (root) [15:35:28] jan_drewniak: live on mw1099 [15:35:35] RECOVERY - Freshness of zerofetch successful run file on cp4006 is OK: OK [15:35:35] RECOVERY - dhclient process on cp4006 is OK: PROCS OK: 0 processes with command name dhclient [15:35:35] RECOVERY - Varnish traffic logger - varnishreqstats on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishreqstats, UID = 0 (root) [15:35:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [15:35:46] RECOVERY - Varnish traffic logger - varnishmedia on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [15:35:46] RECOVERY - Confd vcl based reload on cp4006 is OK: reload-vcl successfully ran 0h, 0 minutes ago. [15:35:55] RECOVERY - MD RAID on cp4006 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [15:35:55] RECOVERY - Varnish traffic logger - varnishstatsd on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishstatsd, UID = 0 (root) [15:35:55] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [15:35:55] RECOVERY - HTTPS on cp4006 is OK: SSLXNN OK - 36 OK [15:36:06] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 400 bytes in 0.151 second response time [15:36:06] RECOVERY - salt-minion processes on cp4006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:36:15] RECOVERY - SSH on cp4006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [15:36:15] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp4006 is OK: No errors detected [15:36:16] RECOVERY - Varnish traffic logger - varnishxcache on cp4006 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcache, UID = 0 (root) [15:36:25] RECOVERY - traffic-pool service on cp4006 is OK: OK - traffic-pool is active [15:36:35] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp4006 is OK: No errors detected [15:36:44] RECOVERY - Varnishkafka log producer on cp4006 is OK: PROCS OK: 1 process with command name varnishkafka [15:36:44] RECOVERY - DPKG on cp4006 is OK: All packages OK [15:36:44] RECOVERY - Varnish HTCP daemon on cp4006 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [15:36:46] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 187 bytes in 0.154 second response time [15:36:56] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 400 bytes in 0.149 second response time [15:36:56] RECOVERY - Disk space on cp4006 is OK: DISK OK [15:37:05] RECOVERY - confd service on cp4006 is OK: OK - confd is active [15:37:34] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:39:06] (03PS2) 10Alexandros Kosiaris: Introduce aluminium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/305014 (https://phabricator.wikimedia.org/T134496) [15:39:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce aluminium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/305014 (https://phabricator.wikimedia.org/T134496) (owner: 10Alexandros Kosiaris) [15:39:26] thcipriani: "live on mw1099" don't know how to interpret that, also don't know what 'depooling' means. Does the patch have a chance of going live soon? [15:40:04] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [15:41:15] PROBLEM - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - Certificate *.wmflabs.org valid until 2016-09-15 15:41:05 +0000 (expires in 29 days) [15:41:19] jan_drewniak: the patch is live on the server mw1099.eqiad.wmnet. It's a new step in the SWAT process so that we can check changes using https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug befor ethey go live everywhere [15:43:50] thcipriani: ok, that sounds like a good idea. I have to install that chrome extension though, one sec [15:44:09] jan_drewniak: ack, sure, thanks :) [15:45:28] thcipriani: yup! looks good on mw1099 :) [15:45:43] jan_drewniak: cool, thanks for checking, going live everywhere with the sync-portals script [15:47:37] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:304976|Bumping portals to master (T140153)]] (duration: 00m 51s) [15:47:38] T140153: Wikipedia.org Portal: updated page layout release to production - https://phabricator.wikimedia.org/T140153 [15:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:55] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:48:25] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:304976|Bumping portals to master (T140153)]] (duration: 00m 47s) [15:48:25] T140153: Wikipedia.org Portal: updated page layout release to production - https://phabricator.wikimedia.org/T140153 [15:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:42] jan_drewniak: should be live everywhere! [15:49:37] (03PS1) 10Alexandros Kosiaris: aluminium: Specify the correct IP [dns] - 10https://gerrit.wikimedia.org/r/305032 [15:50:17] thcipriani: looks like it is! thanks, let the storm of angry emails begin now... :P [15:51:00] (03CR) 10Alexandros Kosiaris: [C: 032] aluminium: Specify the correct IP [dns] - 10https://gerrit.wikimedia.org/r/305032 (owner: 10Alexandros Kosiaris) [15:52:11] thcipriani: we changed things up a bit here: www.wikipedia.org , people might have opinions :) [15:53:08] jan_drewniak: oh boy [15:53:39] well, deployment went well :) [15:53:51] :D [15:54:21] 06Operations, 10Cassandra, 10hardware-requests, 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2557458 (10Eevans) >>! In T136340#2554074, @GWicke wrote: >> Owing to their location, data-center support would be minimal > > This looks like the bi... [15:54:43] (03PS1) 10BBlack: caches: reduce FE mem size: 50% -> 40% [puppet] - 10https://gerrit.wikimedia.org/r/305034 (https://phabricator.wikimedia.org/T135384) [15:55:52] (03CR) 10BBlack: [C: 032] caches: reduce FE mem size: 50% -> 40% [puppet] - 10https://gerrit.wikimedia.org/r/305034 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack) [15:58:47] !log rebooting cp4006 [15:58:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:01] * thcipriani wait for CI :( [16:02:05] er waits [16:03:37] (03CR) 10BryanDavis: "Should be a production no-op change and a small adjustment of the umask-wikidev.sh contents on Labs hosts using the mediawiki_vagrant role" [puppet] - 10https://gerrit.wikimedia.org/r/304885 (owner: 10BryanDavis) [16:08:22] mlitn: fine to just sync flow change everywhere? Nothing to check on mw1099, correct? [16:08:42] sounds good [16:08:44] nothing to check indeed [16:08:48] ack, going [16:09:09] !log repooling cp4006 [16:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:17] !log bblack@palladium conftool action : set/pooled=yes; selector: name=cp4006.ulsfo.wmnet [16:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:33] oh, I guess I didn't have to log that manually heh [16:09:55] !log thcipriani@tin Synchronized php-1.28.0-wmf.14/extensions/Flow/maintenance/FlowRestoreLQT.php: SWAT: [[gerrit:304985|Query wiki DB for logging table, not Flow DB (T119509)]] (duration: 00m 57s) [16:09:56] T119509: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509 [16:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:06] ^ mlitn should be live [16:10:12] perfect, thanks! [16:12:26] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 1 failures [16:13:59] !log rolling depooled restarts of varnish-frontend on ulsfo upload caches [16:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:09] (03CR) 10BryanDavis: [C: 032] PHP: Add php5-apcu [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/304406 (owner: 10Yuvipanda) [16:14:22] (03Merged) 10jenkins-bot: PHP: Add php5-apcu [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/304406 (owner: 10Yuvipanda) [16:14:24] (03Merged) 10jenkins-bot: Add tcl base / web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/304889 (owner: 10Yuvipanda) [16:19:07] (03PS2) 10Aklapper: Clarify string in weekly Phabricator Project email [puppet] - 10https://gerrit.wikimedia.org/r/303500 (https://phabricator.wikimedia.org/T142347) [16:27:18] (03Draft2) 10Bartosz Dziewoński: Update workaround for broken Gerrit browser detection [puppet] - 10https://gerrit.wikimedia.org/r/305035 [16:32:29] (03PS16) 10BryanDavis: [WIP] Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) [16:36:39] (03PS6) 10Rush: labstore nfs: nfs client mount manager [puppet] - 10https://gerrit.wikimedia.org/r/304070 (https://phabricator.wikimedia.org/T140483) [16:37:34] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:37:44] (03PS7) 10Rush: labstore nfs: nfs client mount manager [puppet] - 10https://gerrit.wikimedia.org/r/304070 (https://phabricator.wikimedia.org/T140483) [16:38:59] (03CR) 10Rush: [C: 032] labstore nfs: nfs client mount manager [puppet] - 10https://gerrit.wikimedia.org/r/304070 (https://phabricator.wikimedia.org/T140483) (owner: 10Rush) [16:40:36] (03PS2) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) [16:40:39] (03PS2) 10Giuseppe Lavagetto: ssh::client: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304982 (https://phabricator.wikimedia.org/T142846) [16:40:42] (03PS1) 10Giuseppe Lavagetto: naggen2: add --puppetdb switch [puppet] - 10https://gerrit.wikimedia.org/r/305037 (https://phabricator.wikimedia.org/T142846) [16:42:45] (03Abandoned) 10Giuseppe Lavagetto: sshknowngen: add puppetdb-compatible version [puppet] - 10https://gerrit.wikimedia.org/r/304485 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [16:52:43] thcipriani, sorry not responding earlier. Based on my testing (including setting the clock to the future), it does not seem to cause practical issues. [16:53:09] matt_flaschen: ack, np, that's good :) [16:55:28] 06Operations, 10ops-codfw, 06Discovery: rack/setup/deploy wqds200[12] - https://phabricator.wikimedia.org/T142864#2557659 (10RobH) [16:56:24] 06Operations, 10ops-codfw, 06Discovery: rack/setup/deploy wqds200[12] - https://phabricator.wikimedia.org/T142864#2549225 (10RobH) [16:57:01] papaul: ^ those wdqs install params shifted, they are hardware raid1 for the dual 800gb ssds. [16:57:01] but task updated to reflect new stuff [16:57:10] hw raid1, parition recipe. [16:58:14] (03PS1) 10Gehel: package_builder: Add gradle-debian-helper to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305038 [16:59:05] (03PS17) 10BryanDavis: [WIP] Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) [17:00:05] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160816T1700). [17:00:11] no parsoid deploy today [17:00:19] dono yet [17:00:25] no ores today [17:00:29] might need to depl kartotherian [17:04:43] robh: ok [17:10:25] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [17:19:25] (03PS3) 10Dzahn: Update workaround for broken Gerrit browser detection [puppet] - 10https://gerrit.wikimedia.org/r/305035 (owner: 10Bartosz Dziewoński) [17:22:41] (03CR) 10Dzahn: [C: 032] "thanks, yes, it said it would need this update after gerrit version change" [puppet] - 10https://gerrit.wikimedia.org/r/305035 (owner: 10Bartosz Dziewoński) [17:23:39] (03PS2) 10Dzahn: Phab: Remove unused phab-deploy-key files [puppet] - 10https://gerrit.wikimedia.org/r/301847 (owner: 10Chad) [17:24:27] (03CR) 10Dzahn: [C: 032] Phab: Remove unused phab-deploy-key files [puppet] - 10https://gerrit.wikimedia.org/r/301847 (owner: 10Chad) [17:25:14] (03CR) 10Chad: "We need to find a more future-proof way to deal with this, this is annoying as shit." [puppet] - 10https://gerrit.wikimedia.org/r/305035 (owner: 10Bartosz Dziewoński) [17:27:28] (03CR) 10Dzahn: "i don't know, see the comment in his change about gerrit normalizing the config and how it sometimes removes the " and sometimes does not " [puppet] - 10https://gerrit.wikimedia.org/r/304977 (owner: 10Paladox) [17:28:33] (03CR) 10Dzahn: "does Gerrit remove them when running init? then yea" [puppet] - 10https://gerrit.wikimedia.org/r/304977 (owner: 10Paladox) [17:30:05] (03CR) 10Muehlenhoff: [C: 04-1] "It seems you downloaded the wrong version, the maps cluster is running osmosis from Debian stable, but you downloaded the source from jes" [puppet] - 10https://gerrit.wikimedia.org/r/305038 (owner: 10Gehel) [17:31:04] (03CR) 10Dzahn: "how about the other reviewers? do we have consensus on this one?" [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [17:37:35] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:43:00] (03Abandoned) 10Gehel: package_builder: Add gradle-debian-helper to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305038 (owner: 10Gehel) [17:43:46] 06Operations, 06Services, 07Service-deployment-requests, 15User-mobrovac: New service request - PDF Render - https://phabricator.wikimedia.org/T143129#2557819 (10mobrovac) [17:44:26] 06Operations, 06Services, 07Service-deployment-requests, 15User-mobrovac: New service request - PDF Render - https://phabricator.wikimedia.org/T143129#2557835 (10mobrovac) [17:44:30] 06Operations, 06Services, 06Services-next, 05Security, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2527912 (10mobrovac) [17:45:12] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and fluorine for chelsyx - https://phabricator.wikimedia.org/T142648#2557838 (10mpopov) [17:46:18] !log uploaded hhvm 3.12.7+dfsg+wmf1 for jessie-wikimedia to carbon (also includes a fix for T137642) [17:46:19] T137642: IcuCollation sort keys depend on PHP/HHVM version - https://phabricator.wikimedia.org/T137642 [17:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:44] 06Operations, 06Services, 06Services-next, 05Security, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2557895 (10mobrovac) [17:50:25] (03PS1) 10Giuseppe Lavagetto: role::kafka::main::mirror: allow fetching configs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/305041 [17:50:31] <_joe_> ottomata: ^^ [17:50:47] <_joe_> ottomata: wildly untested and undocumented though [17:50:53] error 503 while accessing otrs [17:51:01] worth reporting? [17:51:16] <_joe_> Vito: continuing 503s or just one-off? [17:51:49] mh a burst of [17:51:51] now gone [17:51:51] (03CR) 10Ori.livneh: [C: 031] redis::instance: use specific aof/rdb file names by default [puppet] - 10https://gerrit.wikimedia.org/r/301789 (https://phabricator.wikimedia.org/T134400) (owner: 10Giuseppe Lavagetto) [17:51:59] (03CR) 10jenkins-bot: [V: 04-1] role::kafka::main::mirror: allow fetching configs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/305041 (owner: 10Giuseppe Lavagetto) [17:52:02] (03CR) 10EBernhardson: [C: 031] Enable Language ID for Russian, Japanese, Portuguese Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304328 (https://phabricator.wikimedia.org/T142413) (owner: 10Tjones) [17:54:06] Hm, interseting _joe_! [17:54:24] same for etherpad [17:54:24] so now the hiera is done by cluster names and sites [17:54:35] I guess it's because of the eqiad clean up [17:54:55] and the new function uses that info and calls kafka_config to construct params for mirrors [17:54:55] hmmm [17:54:59] interesting [17:55:09] i would move whitelist out somehow, maybe put that in hiera too [17:55:14] but ok, i get the idea. [17:55:15] thanks [17:55:18] (03PS1) 10Gehel: package_builder: Add gradle to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305043 [17:55:19] will see what i can do with that [17:58:16] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:59:34] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2505524 (10RobH) I'm on clinic duty this week, and I want to ensure this is still pending the 4 users signing up on wikit... [18:02:05] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/305043 (owner: 10Gehel) [18:02:51] !log starting tile generation for zoom levels 0-10 on maps eqiad [18:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:36] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:04:40] (03PS1) 10Rush: labs: nfs client [puppet] - 10https://gerrit.wikimedia.org/r/305044 [18:04:52] (03PS2) 10Rush: labs: nfs client [puppet] - 10https://gerrit.wikimedia.org/r/305044 [18:06:44] (03PS1) 10RobH: shell access for ovasileva [puppet] - 10https://gerrit.wikimedia.org/r/305045 [18:07:27] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for ovasileva - https://phabricator.wikimedia.org/T142502#2557953 (10RobH) Ok, in reviewing @JKatzWMF's access, he is in the following: bastiononly, researchers, statistics-users, & analytics-privatedata-users. As you ar... [18:07:33] (03CR) 10Rush: [C: 032] labs: nfs client [puppet] - 10https://gerrit.wikimedia.org/r/305044 (owner: 10Rush) [18:07:57] does gerrit log eeryone else out daily? [18:08:36] I don't think our gerrit does it daily to me [18:08:43] but certainly more often [18:09:17] (03CR) 10RobH: [C: 032] shell access for ovasileva [puppet] - 10https://gerrit.wikimedia.org/r/305045 (owner: 10RobH) [18:09:27] (03PS2) 10RobH: shell access for ovasileva [puppet] - 10https://gerrit.wikimedia.org/r/305045 [18:10:22] Krenair: at least im not imaginging it [18:10:28] it seems very very often. [18:10:50] every day for me it seems like at least :) [18:10:57] I take it stride but it's annoying [18:13:00] 06Operations, 10Beta-Cluster-Infrastructure: Check status of under_NDA group - https://phabricator.wikimedia.org/T142822#2557968 (10greg) So.... remove `under_NDA`? [18:13:32] robh: have you tried clearing your cookies? Maybe something is in a confused state after the upgrade [18:14:36] Maybe it's daily and I just haven't noticed [18:14:48] yeah i cleared them out post upgrade the first time i had to log back in [18:15:07] cleared then restarted the multiple 2fa apps to login to, heh [18:16:27] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002 and bast1001 for ovasileva - https://phabricator.wikimedia.org/T142502#2558006 (10RobH) 05Open>03Resolved a:05ovasileva>03None Access request has been merged live to the cluster. Affected hosts may take up to 30 minutes to ca... [18:18:25] uh, https://gerrit.wikimedia.org/r/#/c/305045/ post merge build failed [18:18:33] this seems not normal, with workspace errors [18:18:44] (03PS2) 10BBlack: Experimental error handling for buggy Win+Chrome/41 [puppet] - 10https://gerrit.wikimedia.org/r/305029 (https://phabricator.wikimedia.org/T141786) [18:18:44] https://integration.wikimedia.org/ci/job/operations-puppet-doc/25613/console [18:21:13] what's the puppet-doc thing? it's trying to parse documentation from our manifests? [18:21:35] 06Operations, 10Icinga, 06Operations-Software-Development, 13Patch-For-Review: Automate creation of Phab task for failed disks - https://phabricator.wikimedia.org/T142085#2558037 (10Volans) [18:21:37] 06Operations, 06Operations-Software-Development, 10Packaging, 10Phabricator: Package Python phabricator module for both Ubuntu Precise and Debian Jessie - https://phabricator.wikimedia.org/T142097#2558035 (10Volans) 05Open>03Resolved python-phabricator (0.6.1-1) has been backported from Debian Stretch... [18:21:53] (03CR) 10BBlack: [C: 032] Experimental error handling for buggy Win+Chrome/41 [puppet] - 10https://gerrit.wikimedia.org/r/305029 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [18:21:56] yeah, i guess so, its not voting or anything [18:22:06] but i hadn't gotten it on the last few merges (a failure) [18:22:08] it creates https://doc.wikimedia.org/puppet/ [18:22:09] so i noticed it [18:22:11] afaik [18:22:45] plus its hating a file i didnt touch [18:23:02] but has just been introduced i imagine since my last merge yesterday [18:23:40] 06Operations, 10Domains, 10Traffic: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558051 (10Mjohnson_WMF) [18:23:43] It doesn't seem to be site breaking or anything, and its post-merge so doesn't affect our actual general operations use, but i dont like to assume so wanted to bring it up in here. [18:23:48] yes robh, the "Unrecognised escape sequence '\.'" look like they existed all the time too [18:24:07] it looks more like something related to an upgrade of software on the integration slaves [18:24:21] it does seem new [18:25:20] nice, vcl syntax error :P [18:26:14] 06Operations, 10Domains, 10Traffic: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558082 (10Mjohnson_WMF) [18:26:50] (03PS1) 10BBlack: VCL syntax bugfix for fa3cb7e2c [puppet] - 10https://gerrit.wikimedia.org/r/305048 (https://phabricator.wikimedia.org/T141786) [18:27:32] (03PS2) 10BBlack: VCL syntax bugfix for e7acb5adb [puppet] - 10https://gerrit.wikimedia.org/r/305048 (https://phabricator.wikimedia.org/T141786) [18:27:47] (03CR) 10BBlack: [C: 032 V: 032] VCL syntax bugfix for e7acb5adb [puppet] - 10https://gerrit.wikimedia.org/r/305048 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [18:30:55] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:31] 07Puppet, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service, 15User-Ladsgroup: Move vagrant role to use ores in production - https://phabricator.wikimedia.org/T142618#2558116 (10Ladsgroup) 05Open>03Resolved [18:32:36] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service, 15User-Ladsgroup: Puppet config changes for ORES refactor - https://phabricator.wikimedia.org/T141575#2558118 (10Ladsgroup) [18:32:39] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service, 15User-Ladsgroup: Puppet config changes for ORES refactor - https://phabricator.wikimedia.org/T141575#2503579 (10Ladsgroup) [18:32:42] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service, 15User-Ladsgroup: Change CP to do several models at once. - https://phabricator.wikimedia.org/T142360#2558119 (10Ladsgroup) 05Open>03Resolved [18:34:36] (03CR) 10Dpatrick: "Ah, I see that block now Gergő. Thanks. And yes, that was the ticket I was thinking of." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303939 (owner: 10Gergő Tisza) [18:35:02] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request for removing access to analytics machines - https://phabricator.wikimedia.org/T143140#2558134 (10leila) [18:35:30] (03PS1) 10BBlack: Chrome/41 fix: move to backend_response [puppet] - 10https://gerrit.wikimedia.org/r/305049 (https://phabricator.wikimedia.org/T141786) [18:35:47] (03CR) 10BBlack: [C: 032 V: 032] Chrome/41 fix: move to backend_response [puppet] - 10https://gerrit.wikimedia.org/r/305049 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [18:36:10] (03CR) 1020after4: [C: 031] phabricator: add systemd unit file for phd service [puppet] - 10https://gerrit.wikimedia.org/r/303740 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [18:38:44] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:38:45] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [18:38:46] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:39:35] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [18:39:35] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [18:39:39] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558153 (10Danny_B) [18:40:34] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [18:40:36] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:40:46] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:41:14] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558158 (10Danny_B) [18:41:19] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2558160 (10DarTar) @RobH: still waiting on 3 out of 4 (the team was on vacation) @Panisson is https://wikitech.wikimedia... [18:41:35] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:41:35] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:42:34] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:43:11] (03PS1) 10BBlack: Chrome/41 fix: move to vcl_recv [puppet] - 10https://gerrit.wikimedia.org/r/305050 (https://phabricator.wikimedia.org/T141786) [18:43:23] (03CR) 10BBlack: [C: 032 V: 032] Chrome/41 fix: move to vcl_recv [puppet] - 10https://gerrit.wikimedia.org/r/305050 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [18:43:57] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2558166 (10RobH) Sounds good, I'll prepare the patchset tomorrow for those with wikitech names by then. (Hopefully every... [18:52:55] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request for removing access to analytics machines - https://phabricator.wikimedia.org/T143140#2558134 (10RobH) @leila: First, thanks for filing this. Its very easy to overlook the offboarding steps, so it is very appreciated! Ashwin's account curren... [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160816T1900). [19:03:00] jouncebot: chooo choo [19:08:55] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2558225 (10DarTar) @Daniela.paolotti is https://wikitech.wikimedia.org/wiki/User:Daniela.paolotti [19:10:23] (03PS1) 10BBlack: Chrome/41 fix: try 403 [puppet] - 10https://gerrit.wikimedia.org/r/305053 (https://phabricator.wikimedia.org/T141786) [19:10:44] :) [19:10:48] (03CR) 10BBlack: [C: 032 V: 032] Chrome/41 fix: try 403 [puppet] - 10https://gerrit.wikimedia.org/r/305053 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [19:11:34] 06Operations, 06Services, 07Service-deployment-requests, 15User-mobrovac: New service request - PDF Render - https://phabricator.wikimedia.org/T143129#2558244 (10GWicke) [19:13:35] (03CR) 10Ladsgroup: "We had this problem of the repo couldn't be cloned that got fixed yesterday. And now we can safely use the other repo. So it's okay to mer" [puppet] - 10https://gerrit.wikimedia.org/r/296687 (https://phabricator.wikimedia.org/T139008) (owner: 10Ladsgroup) [19:16:42] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-collaborations: Analytics cluster access request for ISI Foundation team - https://phabricator.wikimedia.org/T141634#2558270 (10Michele.tizzoni) Hi, I created my user on wikitech: https://wikitech.wikimedia.org/wiki/User:Michele.tizzoni [19:20:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [19:20:41] (03PS1) 10BBlack: Chrome/41 fix: maybe retry-after will slow it down? [puppet] - 10https://gerrit.wikimedia.org/r/305055 (https://phabricator.wikimedia.org/T141786) [19:21:11] (03CR) 10BBlack: [C: 032 V: 032] Chrome/41 fix: maybe retry-after will slow it down? [puppet] - 10https://gerrit.wikimedia.org/r/305055 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [19:25:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [19:26:36] gwicke: how did the meeting go? I was up all night so missed it. [19:29:04] note there might be a warning about 503/5xx on text. this is expected, and it's not natural traffic and not affecting "real" users... [19:30:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [19:31:04] (03PS2) 10Gehel: package_builder: Add gradle to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305043 [19:32:20] (03CR) 10Gehel: [C: 032] package_builder: Add gradle to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305043 (owner: 10Gehel) [19:34:35] (03PS1) 10BBlack: Chrome/41 fix: 401 seems to stop it cold... [puppet] - 10https://gerrit.wikimedia.org/r/305058 (https://phabricator.wikimedia.org/T141786) [19:35:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [19:35:21] (03PS2) 10BBlack: Chrome/41 fix: 401 seems to stop it cold... [puppet] - 10https://gerrit.wikimedia.org/r/305058 (https://phabricator.wikimedia.org/T141786) [19:35:28] (03CR) 10BBlack: [C: 032 V: 032] Chrome/41 fix: 401 seems to stop it cold... [puppet] - 10https://gerrit.wikimedia.org/r/305058 (https://phabricator.wikimedia.org/T141786) (owner: 10BBlack) [19:37:49] (03PS1) 10Gehel: package_builder: Add gradle to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305059 [19:38:58] (03CR) 10Gehel: [C: 032] package_builder: Add gradle to list of installed packages [puppet] - 10https://gerrit.wikimedia.org/r/305059 (owner: 10Gehel) [19:40:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [19:44:18] 06Operations, 10Traffic, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2558383 (10BBlack) Update from the spam of VCL experiments above (and several that weren't through the repo as one-off tests on one host): In attempting to interfere... [19:45:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [19:45:26] * Revent needs someone with shell access, presumably. [19:46:28] https://commons.wikimedia.org/w/index.php?title=File:Karol_May_-_Cyganie_i_przemytnicy.djvu <- need the cached thumbnails killed… [19:47:14] File was changed 5x days ago (and then even tried moving it) and 1024px thumbs have not updated. (but new 1025px thumbs are fine) [19:47:43] *they even tried [19:47:58] !log twentyafterfour@tin Started scap: sync testwiki to 1.28.0-wmf.15 refs T140971 [19:47:59] T140971: MW-1.28.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T140971 [19:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:04] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request for removing access to analytics machines - https://phabricator.wikimedia.org/T143140#2558389 (10leila) @RobH sure. let's remove bastion access as well, and yes 'analytics machines' includes stat1002. I just made a copy of stat1002/home/ashwin... [19:48:36] I hate to ping the guy on duty…. [19:49:12] robh: ^ [19:49:55] hrmm, i have not done that. was there a task for it last time it had to happen (that you are aware of?) [19:50:11] Someone just hunted them down. [19:50:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: puppet fail [19:50:26] It was a while back, last time I personally had to ask. [19:50:48] https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Karol_May_-_Cyganie_i_przemytnicy.djvu/page85-1025px-Karol_May_-_Cyganie_i_przemytnicy.djvu.jpg vs https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Karol_May_-_Cyganie_i_przemytnicy.djvu/page85-1024px-Karol_May_-_Cyganie_i_przemytnicy.djvu.jpg [19:50:54] (if that helps) [19:50:54] I'm asking around in some other channels to see if someone is around who can assist (and then i can see hwo they do it for future) [19:51:48] Usually such things are just lag between the datacenters, and sort themselves after a few hours (and not everyone sees the broken one) [19:51:51] is one of those supposed to be a bad thumbnail? [19:51:57] oh, page is wrong [19:52:03] i see [19:52:12] (03CR) 10Chad: "Bumpity bump bump. Any blocker on this?" [puppet] - 10https://gerrit.wikimedia.org/r/301409 (owner: 10Chad) [19:52:13] The 1025 one is the changed file. [19:52:35] sorry, my brain was on another thumbnail generation issue, wasnt reading contents so much as looking for bad generation [19:52:59] (03CR) 10Chad: "Still super trivial since no callers using it (yet). Can we merge before there's needed collateral cleanup?" [puppet] - 10https://gerrit.wikimedia.org/r/301484 (owner: 10Chad) [19:53:11] Yeah, it’s not the thing you guys broke with jpegs the other day. :P [19:53:33] my instant fear was 'oh shit are there black lines?' ;] [19:53:50] (i was here, lol) [19:54:11] No one is answering me in the other channels. I'm not entirely certain who I should annoy, but if no one answers me in a few minutes I'll escalate to our team email [19:54:28] if ops clinic cannot do anythign else, we can annoy the rest of our team ;] [19:54:34] kk, thanks. [19:55:00] This is the appropriate public place to nag, right? I mean, it’s where has worked before, but... [19:55:14] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:55:36] (this is, afaik, a ‘once every few months’ thing) [19:58:39] (03Abandoned) 10Chad: Gerrit: Support footer prefix Task: for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/302482 (https://phabricator.wikimedia.org/T91001) (owner: 10Paladox) [19:59:33] (03Abandoned) 10Chad: Gerrit: Support http only, configured by a config [puppet] - 10https://gerrit.wikimedia.org/r/303146 (https://phabricator.wikimedia.org/T141803) (owner: 10Paladox) [20:02:27] (03CR) 10Dzahn: [C: 032] "no-op on iridium. new file on phab2001 http://puppet-compiler.wmflabs.org/3719/" [puppet] - 10https://gerrit.wikimedia.org/r/303740 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [20:02:33] (03PS4) 10Dzahn: phabricator: add systemd unit file for phd service [puppet] - 10https://gerrit.wikimedia.org/r/303740 (https://phabricator.wikimedia.org/T137928) [20:06:37] 06Operations, 10ops-codfw: rack/setup/deploy wezen (codfw syslog) - https://phabricator.wikimedia.org/T143146#2558409 (10RobH) [20:08:28] (03CR) 10Dzahn: "@20after4 one step closer. systemd now recognizes the service as such." [puppet] - 10https://gerrit.wikimedia.org/r/303740 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [20:10:18] robh: You could just “find -name” across the whole filesystem… :P [20:10:43] (I can’t imagine how long that would take, lol) [20:12:05] there's a script to purge urls [20:12:26] not sure if the problem is in the fs or in varnish [20:12:27] pugeList.php [20:14:43] Presumably all 182 pages are broken. [20:14:58] (since it’s a djvu) [20:17:16] if its varnish, there are notes on how one shouldnt just blindly try to purge things =P [20:17:32] Heh. [20:17:34] I'm finishing the task im on and then writing up an email to ops to try to figure it out (no one spoke up in other channels) [20:18:23] Tis not urgent, really, just needs sorted for the sake of the plws guys. [20:19:32] no worries, thanks for bringing it up =] [20:19:38] (03PS1) 10RobH: disabling user ashwinpp [puppet] - 10https://gerrit.wikimedia.org/r/305077 [20:20:06] 06Operations, 10Traffic, 13Patch-For-Review: TLS stats regression related to Chrome/41 on Windows - https://phabricator.wikimedia.org/T141786#2558501 (10BBlack) The interesting bit is that while these were prominent requests, apparently Chrome/41-on-Windows isn't the whole story of the mysterious rise in `EC... [20:20:14] RECOVERY - check_ipn_redir on mintaka is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 460 bytes in 0.012 second response time [20:20:22] (03CR) 10RobH: [C: 032] disabling user ashwinpp [puppet] - 10https://gerrit.wikimedia.org/r/305077 (owner: 10RobH) [20:26:12] 06Operations, 06Commons, 06Multimedia, 10media-storage: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#2558555 (10Revent) As an update, the first file I mentioned being unable to restore worked just now. The second still gives the error. [20:26:46] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558051 (10Dzahn) Hi, this will be almost completely like creating a public wiki, except for some config options later on, afaict. All the steps for a new w... [20:26:59] * Revent still afraid to attempt history splits ^ [20:27:38] (03PS1) 10Thcipriani: scap: bump version to 3.2.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/305078 [20:27:39] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request for removing access to analytics machines - https://phabricator.wikimedia.org/T143140#2558565 (10RobH) 05Open>03Resolved a:03RobH I've merged https://gerrit.wikimedia.org/r/#/c/305077/ which disables access. Systems may take up to 30 mi... [20:28:12] (03CR) 10Thcipriani: [C: 04-1] "Version not yet uploaded to carbon" [puppet] - 10https://gerrit.wikimedia.org/r/305078 (owner: 10Thcipriani) [20:28:38] 06Operations, 10Traffic: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1323201 (10BBlack) While looking into T141786 I stumbled on this again... lots of probably-illegitimate traffic to geoiplookup.wm.o with no referer header and no user-agent, spamming from all over. So it's bugg... [20:34:00] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558621 (10Dzahn) We should be sure "projectcom" is the name to stick too, as renaming wikis later is not really a viable option. I'll start with that "step... [20:34:25] 06Operations, 10Security-Reviews, 06Services, 06Services-next, 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2558625 (10dpatrick) [20:37:50] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 3 others: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2558632 (10greg) > @RobH moved this task from Backlog to In Discussion on the Ops-Acce... [20:39:08] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 3 others: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2558655 (10RobH) Correct, this seems to be under discussion overall, not within the te... [20:40:29] (03PS1) 10Dzahn: realm: add 'projectcom' to private wiki list [puppet] - 10https://gerrit.wikimedia.org/r/305095 (https://phabricator.wikimedia.org/T143138) [20:40:43] !log twentyafterfour@tin Finished scap: sync testwiki to 1.28.0-wmf.15 refs T140971 (duration: 52m 45s) [20:40:45] T140971: MW-1.28.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T140971 [20:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:58] (03PS2) 10Dzahn: realm: add 'projectcom' to private wiki list [puppet] - 10https://gerrit.wikimedia.org/r/305095 (https://phabricator.wikimedia.org/T143138) [20:47:36] (03PS1) 10Dzahn: add projectcom.wikimedia.org for new private wiki [dns] - 10https://gerrit.wikimedia.org/r/305120 (https://phabricator.wikimedia.org/T143138) [20:47:45] (03PS1) 1020after4: group0 wikis to 1.28.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305121 [20:47:56] (03CR) 1020after4: [C: 032] group0 wikis to 1.28.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305121 (owner: 1020after4) [20:48:22] (03Merged) 10jenkins-bot: group0 wikis to 1.28.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305121 (owner: 1020after4) [20:48:39] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.28.0-wmf.15 [20:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:28] 06Operations, 10ops-codfw: codfw: rack/setup/deploy wezen (syslog server) switch configuration - https://phabricator.wikimedia.org/T143153#2558787 (10Papaul) [20:59:41] (03PS5) 10Krinkle: mwgrep: fails gracefully when an invalid regex is provided [puppet] - 10https://gerrit.wikimedia.org/r/302892 (https://phabricator.wikimedia.org/T141996) (owner: 10DCausse) [20:59:50] (03CR) 10Krinkle: [C: 031] mwgrep: fails gracefully when an invalid regex is provided [puppet] - 10https://gerrit.wikimedia.org/r/302892 (https://phabricator.wikimedia.org/T141996) (owner: 10DCausse) [21:00:31] 06Operations, 10ops-codfw: rack/setup/deploy wezen (codfw syslog) - https://phabricator.wikimedia.org/T143146#2558821 (10RobH) [21:00:32] 06Operations, 10ops-codfw: codfw: rack/setup/deploy wezen (syslog server) switch configuration - https://phabricator.wikimedia.org/T143153#2558819 (10RobH) 05Open>03Resolved set port to enabled, set description to wezen, and placed in the private1-d vlan. [21:07:18] (03CR) 10Dzahn: [C: 031] "not entirely no-op in prod because of the rename from modules/role/files to modules/files but that looks good to me . compiler: http://pup" [puppet] - 10https://gerrit.wikimedia.org/r/304885 (owner: 10BryanDavis) [21:07:27] (03PS2) 10Dzahn: Extract /etc/profile.d/umask-wikidev.sh to a shared class [puppet] - 10https://gerrit.wikimedia.org/r/304885 (owner: 10BryanDavis) [21:11:48] 06Operations, 06Commons, 06Multimedia, 10media-storage: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#2558846 (10aaron) >>! In T141704#2558555, @Revent wrote: > As an update, the first file I mentioned being unable to restore worked just now. Th... [21:13:23] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2558051 (10Platonides) Why "projectcom"? I would have expected something like "projectgrants" or "grantscom" / "grantscommittee". Also,... [21:13:57] 06Operations, 06Commons, 06Multimedia, 10media-storage: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#2558851 (10Revent) Yes, nothing until I actually hit 'restore'. This is apparently quite uncommon, but a bit catastrophic when it does occur. [21:14:49] 06Operations, 10ops-codfw: rack/setup/deploy wezen (codfw syslog) - https://phabricator.wikimedia.org/T143146#2558857 (10Papaul) [21:15:10] 06Operations, 06Commons, 06Multimedia, 10media-storage: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#2558859 (10Revent) As a further note, the deleted file, and page revisions, are visible from the history (when, of course, looking at deleted r... [21:15:11] (03CR) 10Dzahn: [C: 032] Extract /etc/profile.d/umask-wikidev.sh to a shared class [puppet] - 10https://gerrit.wikimedia.org/r/304885 (owner: 10BryanDavis) [21:18:07] 06Operations, 10Ops-Access-Requests, 06Research-and-Data: Request for removing access to analytics machines - https://phabricator.wikimedia.org/T143140#2558913 (10leila) Thank you very much! :) [21:23:37] (03PS1) 10Papaul: DNS: Add mgmt entries for wezen (new syslog server) Bug:T143146 [dns] - 10https://gerrit.wikimedia.org/r/305130 (https://phabricator.wikimedia.org/T143146) [21:24:27] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/304885 (owner: 10BryanDavis) [21:29:51] (03CR) 10RobH: [C: 032] DNS: Add mgmt entries for wezen (new syslog server) Bug:T143146 [dns] - 10https://gerrit.wikimedia.org/r/305130 (https://phabricator.wikimedia.org/T143146) (owner: 10Papaul) [21:29:58] (03PS2) 10RobH: DNS: Add mgmt entries for wezen (new syslog server) Bug:T143146 [dns] - 10https://gerrit.wikimedia.org/r/305130 (https://phabricator.wikimedia.org/T143146) (owner: 10Papaul) [21:32:09] 06Operations, 10ops-codfw: rack/setup/deploy wezen (codfw syslog) - https://phabricator.wikimedia.org/T143146#2558982 (10Papaul) [21:32:25] (03CR) 10Dzahn: [C: 031] "yep, just waiting a bit to merge with other config changes to reduce number of gerrit restarts per day/week" [puppet] - 10https://gerrit.wikimedia.org/r/304977 (owner: 10Paladox) [21:32:28] 06Operations, 10ops-codfw: rack/setup/deploy wezen (codfw syslog) - https://phabricator.wikimedia.org/T143146#2558409 (10Papaul) [21:34:52] (03CR) 10Dzahn: Configure phabricator database cluster settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300494 (https://phabricator.wikimedia.org/T112776) (owner: 1020after4) [21:37:34] !log restarted grrrit-wm for config change 304746 [21:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:02] (03PS2) 10RobH: Decommission:Remove mgmt DNS entries for es2005-es2010 from Bug:T134755 [dns] - 10https://gerrit.wikimedia.org/r/293792 (https://phabricator.wikimedia.org/T134755) (owner: 10Papaul) [21:41:45] (03CR) 10RobH: [C: 032] Decommission:Remove mgmt DNS entries for es2005-es2010 from Bug:T134755 [dns] - 10https://gerrit.wikimedia.org/r/293792 (https://phabricator.wikimedia.org/T134755) (owner: 10Papaul) [21:42:06] (03PS3) 1020after4: Configure phabricator database cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/300494 (https://phabricator.wikimedia.org/T112776) [21:43:13] (03CR) 1020after4: "@dzahn: Thanks for catching that." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300494 (https://phabricator.wikimedia.org/T112776) (owner: 1020after4) [21:43:15] (03CR) 10Dzahn: "what's the status here paladox, you said yourself it doesn't work (yet). are you planning to amend later?" [puppet] - 10https://gerrit.wikimedia.org/r/301849 (owner: 10Paladox) [21:43:40] (03CR) 1020after4: [C: 031] Configure phabricator database cluster settings [puppet] - 10https://gerrit.wikimedia.org/r/300494 (https://phabricator.wikimedia.org/T112776) (owner: 1020after4) [21:47:27] (03PS4) 10Ottomata: Mirror main-eqiad into main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/304928 (https://phabricator.wikimedia.org/T134184) [21:47:55] (03CR) 10Ottomata: "I've absorbed this idea into https://gerrit.wikimedia.org/r/#/c/304928, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/305041 (owner: 10Giuseppe Lavagetto) [22:02:30] (03CR) 10jenkins-bot: [V: 04-1] Mirror main-eqiad into main-codfw [puppet] - 10https://gerrit.wikimedia.org/r/304928 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata) [22:16:26] 06Operations, 10Wikimedia-Logstash, 03Discovery-Search-Sprint: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357#2532191 (10debt) Hi @dcausse is there anything else we need to do? [22:21:31] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2559236 (10Mjohnson_WMF) Program Officers are already using projectgrants and projectcom as handles to distinguish between general progr... [22:22:09] how do we handle regressions that the new train is causing? https://phabricator.wikimedia.org/T143165 [22:22:34] (03CR) 10RobH: [C: 032] decom dickson - remove from install_server [puppet] - 10https://gerrit.wikimedia.org/r/302761 (owner: 10RobH) [22:22:37] yurik: mark it as blocking the current deploy task [22:22:40] (03PS2) 10RobH: decom dickson - remove from install_server [puppet] - 10https://gerrit.wikimedia.org/r/302761 [22:22:49] we have a couple already :/ https://phabricator.wikimedia.org/T140971 [22:25:01] (03PS18) 10BryanDavis: Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) [22:25:34] yurik: given we're just on testwikis, we can stay as we are and backport to wmf.15 for that issue, right? (iow: do you think we should rollback wmf.15 due to that issue?) [22:26:49] greg-g, i noticed it on mw.org. Its not a "oh my god, stop the press" bug, but it will prevent people from developing new graphs [22:27:19] yurik: k, noted, will prevent rolling forward, are yo uworking on a fix? [22:27:39] greg-g, considering that i'm horrible with the frontend... might take a while [22:27:49] yurik: what are your suggestions, then? [22:28:00] are there anyone who knows codeeditor? [22:28:13] basically codeeditor is not firing "text has changed" event [22:28:47] git log works for both of us :) [22:29:00] greg-g, hehe, assuming that's what caused it :) [22:29:10] yurik: since you know more about the issue, I'd like you to be on point for fixing/finding someone who will [22:29:13] git diff 954f8a2a757b4acf6c748ceb11c87b94cd0a3112^..954f8a2a757b4acf6c748ceb11c87b94cd0a3112 [22:29:50] * yurik is looking for Thalia [22:31:49] kaldari is really good with this stuff :) [22:33:10] yurik, * [tchanders] idle 96:36:50 [22:33:14] probably in the office though [22:33:39] Krenair, i wonder if you ever miss a beat on IRC :) [22:36:54] (03PS1) 10BryanDavis: Provision Tool Labs admin console (Striker) on Californium [puppet] - 10https://gerrit.wikimedia.org/r/305141 (https://phabricator.wikimedia.org/T136256) [22:36:56] (03PS1) 10BryanDavis: Add toolsadmin.wikimedia.org to misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/305142 (https://phabricator.wikimedia.org/T136256) [22:37:14] 06Operations, 10Wikimedia-Logstash, 03Discovery-Search-Sprint: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357#2532191 (10EBernhardson) I took a look over the github issue and added a comment.I see the mention above that the 2016-07-16 index allows repro... [22:41:04] (03PS1) 10BryanDavis: Add toolsadmin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/305143 (https://phabricator.wikimedia.org/T136256) [22:42:28] (03PS1) 10Papaul: DNS: Add production DNS for wezen (new syslog server) Bug:T143146 [dns] - 10https://gerrit.wikimedia.org/r/305144 (https://phabricator.wikimedia.org/T143146) [22:44:13] 06Operations, 10ops-codfw: rack/setup/deploy wezen (codfw syslog) - https://phabricator.wikimedia.org/T143146#2559337 (10Papaul) [22:46:15] RECOVERY - PHD should be supervising processes on phab2001 is OK: PROCS OK: 21 processes with UID = 997 (phd) [22:47:10] mutante: why codfw is paging? ^^^ [22:50:19] volans: maybe my fault? I just ran puppet on phab2001 and it probably just now set up the icinga rules [22:50:28] I'm not sure how to disable paging [22:50:44] (this would be the first time puppet ran to completion on that machine) [22:51:04] twentyafterfour: not "your" fault, but yeah, looks like the start of phd made Icinga happy hence the recovery [22:51:32] phab2001 should be marked as a backup - failures there shouldn't page anyone, at least not just yet [22:51:41] I'm wondering why was configured to page being codfw [22:51:43] is there a way to do that in icinga or should it be in puppet config? [22:51:52] volans: I'm not sure about that part [22:52:36] twentyafterfour: puppet, on Icinga you can suppress the notification of a host and all it's services if needed, if it's still a work in progress might be useful anyway [22:54:19] I'm looking at the config and also check_https_phabricator is configured to page, but just those 2 [22:54:36] see modules/phabricator/manifests/monitoring.pp [22:54:43] for the one that paged right now [22:54:54] ok so I need to fix it to be host dependent somehow [22:55:00] and the other too [22:55:29] you could check $::site [22:55:40] if you don't have already something else that define who is the master [22:55:44] or the active [22:56:08] but $::site usually is used for more MW specific stuff [22:56:18] I guess you want to keep Phab independente [22:56:30] s/independente/independent/ [22:57:40] ACKNOWLEDGEMENT - PHD should be supervising processes on phab2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd) 20after4 server not yet live. [22:58:17] twentyafterfour: again? [22:58:29] again? [22:58:50] yep, back to critical [22:58:53] I thought acknowledgement would shut it up [22:59:06] it's back to critical because I don't actually want phd running [22:59:21] the previous page was a recovery [22:59:52] and the ack send a notification too ;) [23:00:06] RoanKattouw, ostriches, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160816T2300). Please do the needful. [23:00:06] MaxSem and Yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] * MaxSem will deploy [23:00:19] thx [23:01:18] volans: sorry, I will make a patch for phabricator::monitoring [23:02:00] twentyafterfour: you can disable notifications for the service or the whole host w/ all services in the meanwhile if you plan to have it flapping (starting and stopping phd) [23:06:04] (03PS1) 1020after4: Make phabricator monitoring dependent on $::site [puppet] - 10https://gerrit.wikimedia.org/r/305149 [23:06:43] yurik, pulled on mw1099 [23:06:56] volans: ^ I disabled notifications in web ui and also submitted a patch for puppet ^ [23:06:57] MaxSem, testing... [23:07:03] thanks [23:07:59] (03CR) 10jenkins-bot: [V: 04-1] Make phabricator monitoring dependent on $::site [puppet] - 10https://gerrit.wikimedia.org/r/305149 (owner: 1020after4) [23:08:40] MaxSem, all's good [23:11:05] !log maxsem@tin Synchronized php-1.28.0-wmf.15/extensions/Graph: https://gerrit.wikimedia.org/r/#/c/305145/ (duration: 00m 59s) [23:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:14] yurik, ^ [23:12:00] MaxSem, perfect, thx [23:12:01] works [23:13:27] !log maxsem@tin Synchronized php-1.28.0-wmf.15/extensions/Kartographer: https://gerrit.wikimedia.org/r/#/c/305080/ (duration: 00m 51s) [23:13:30] jdlrobson: do you think you and Krinkle want to backport/swat https://gerrit.wikimedia.org/r/#/c/305147/ tonight? [23:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:45] (03PS2) 1020after4: Make phabricator monitoring dependent on $::site [puppet] - 10https://gerrit.wikimedia.org/r/305149 [23:14:55] twentyafterfour: great thanks [23:16:04] (03CR) 10Volans: Make phabricator monitoring dependent on $::site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305149 (owner: 1020after4) [23:16:30] (03PS2) 10MaxSem: wmgUseWPB → wmgUseWikidataPageBanner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304631 (owner: 10Dereckson) [23:16:45] (03CR) 10MaxSem: [C: 032] wmgUseWPB → wmgUseWikidataPageBanner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304631 (owner: 10Dereckson) [23:17:11] (03Merged) 10jenkins-bot: wmgUseWPB → wmgUseWikidataPageBanner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304631 (owner: 10Dereckson) [23:17:38] 06Operations, 10Ops-Access-Requests, 10Striker: deploy-service access for bd808 - https://phabricator.wikimedia.org/T143174#2559450 (10bd808) [23:19:13] 06Operations, 10Ops-Access-Requests, 10Striker: deploy-service access for bd808 - https://phabricator.wikimedia.org/T143174#2559464 (10kaldari) I approve as Bryan's manager. [23:20:53] (03PS1) 10BryanDavis: Add bd808 (Bryan Davis) to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/305152 (https://phabricator.wikimedia.org/T143174) [23:21:38] greg-g: yes that would be good. How do I do that? [23:22:01] jdlrobson: get that patch mered and backport to wmf.15, I presume :) (and add to swat) [23:22:58] (03CR) 1020after4: Make phabricator monitoring dependent on $::site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305149 (owner: 1020after4) [23:23:02] greg-g: today's SWAT? [23:23:25] if you can't make it we can do tomorrow morning [23:23:42] jdlrobson, , I can deploy if you act quickly [23:23:57] MaxSem: 305153 [23:24:13] https://gerrit.wikimedia.org/r/305153 [23:24:25] ill add it to deployment wiki page for prosperity [23:24:35] ty [23:25:13] !log maxsem@tin Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/304631/2 (duration: 00m 53s) [23:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:22] greg-g: MaxSem recorded on wiki [23:26:41] grr grr, dear APC [23:28:47] Hi. [23:29:25] yo Dereckson - your change expectedly caused some logspam, waiting for it to settle down [23:29:31] MaxSem: pending more atomic directory deployment, the best way is to duplicate the renamed setting array in IS, deploy IS, remove the duplicated part, deploy wmf-config [23:29:47] (the best way to avoid this kind of spam) [23:29:58] or an isset() in CS ;) [23:30:18] icinga-wm: speak up [23:30:46] works too, it seems reed.y used CS isset method for last bunch of extension registration [23:31:33] last time we got that, 90-120 minutes were needed to don't see anymore the error in fatalmonitor [23:32:52] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 52s) [23:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:52] logstash is clear now though fatalmonitor is still going bonkers [23:35:57] I need to sync https://gerrit.wikimedia.org/r/#/c/305148/ if it's all clear? [23:36:39] are the fatalmonitor numbers going down MaxSem? It grabs the last N lines from hhvm.log and can take a while to clear when things get quiet [23:36:59] greg-g: jdlrobson: Thx for the deploy, yeah we should SWAT it [23:37:12] Along with https://gerrit.wikimedia.org/r/#/c/305127/ [23:37:21] To avoid changing behaviour next week. [23:37:23] Last time, it slightly raise during 30 minutes, then stabilized during 60, then slowly decreased during 30 minutes. [23:38:07] bd808, kinda stabilized [23:38:30] here it seems stable at 35749 (35745 35743 35742 writing this) [23:39:03] interestingly, the number of errors doesn't look similar to Kibana which has fewer [23:39:14] !log twentyafterfour@tin Synchronized php-1.28.0-wmf.15/extensions/BetaFeatures: deploy https://gerrit.wikimedia.org/r/#/c/305148/ (duration: 00m 49s) [23:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:22] Krinkle: unsure what you mean by changing behavior next week, since that is merged in master ti'll be in next weeks release. Doe sit need to be in this weeks? (pardon typos, working in vehicle) [23:39:43] (not driving, in a parking lot outside the annoying closed-too-early coffee shop) [23:40:01] coffee shops should not close until at least midnight [23:40:08] preferrably they should never close [23:40:12] greg-g: My RL-queue reformatting patch changed something, it broke something for Twn (Niklas) which Gilles fixed by making a patch that also changes behaviour in some ways. I then made another patch that restores old behaviour but kept the fix. [23:40:15] agreed, unfortunately I haven't found that in non-college towns :( [23:40:21] I'd like to avoid rolling out the new behaviour further to new wikis [23:40:25] only to change it back next week [23:40:29] heh, confusing much? :) [23:40:38] It already cuased one expected thing, such as the Beta Features popups suddenly working again [23:40:54] Krinkle: since that's merged in master, feel free to backport now [23:40:59] Okay [23:41:00] Thanks [23:41:03] thank you [23:41:20] (backport and deploy, since I was ambiguous) [23:41:31] MaxSem: Okay, let me know when you're done. I'll roll out this one - https://gerrit.wikimedia.org/r/#/c/305154/ [23:43:53] !log maxsem@tin Synchronized php-1.28.0-wmf.14/extensions/Kartographer: https://gerrit.wikimedia.org/r/#/c/305080/ (duration: 00m 51s) [23:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:18] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2559538 (10Dereckson) [ Moving in discussion on the site requests workboard pending definitive name choice. ] [23:45:02] (03CR) 10Volans: [C: 04-1] "Thanks to tackle this!" [software] - 10https://gerrit.wikimedia.org/r/295607 (https://phabricator.wikimedia.org/T138450) (owner: 10Alex Monk) [23:45:14] (03PS2) 10MaxSem: Enable WikidataPageBanner on he.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304632 (https://phabricator.wikimedia.org/T140717) (owner: 10Dereckson) [23:45:21] (03CR) 10MaxSem: [C: 032] Enable WikidataPageBanner on he.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304632 (https://phabricator.wikimedia.org/T140717) (owner: 10Dereckson) [23:45:48] (03Merged) 10jenkins-bot: Enable WikidataPageBanner on he.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304632 (https://phabricator.wikimedia.org/T140717) (owner: 10Dereckson) [23:49:36] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/304632/2 (duration: 00m 54s) [23:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:49:42] Dereckson, ^ [23:49:52] (I tested on mw1099) [23:50:23] Thanks for the deploy. [23:50:25] Krinkle, I'm done. I merged the earlier change, but not deployed - can do so [23:51:25] Do we have somewhere a note about how to unstuck global account renames? [23:53:05] MaxSem: OK, syncing now [23:53:58] MaxSem: Krinkle lemme know when you want me to test [23:55:46] !log krinkle@tin Synchronized php-1.28.0-wmf.15/includes/skins/SkinTemplate.php: 05c82731c9831c465 (duration: 00m 49s) [23:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:56:39] !log krinkle@tin Synchronized php-1.28.0-wmf.15/includes/resourceloader/ResourceLoaderClientHtml.php: 653232f90605 (duration: 00m 48s) [23:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:00] ACKNOWLEDGEMENT - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - Certificate *.wmflabs.org valid until 2016-09-15 15:41:05 +0000 (expires in 29 days) daniel_zahn https://phabricator.wikimedia.org/T140647 [23:57:37] !log krinkle@tin Synchronized php-1.28.0-wmf.15/includes/OutputPage.php: 653232f90605 (duration: 00m 52s) [23:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master