[00:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161202T0000). Please do the needful. [00:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:10] \o [00:00:14] !log bsitzmann@tin Finished deploy [mobileapps/deploy@b545699]: Update mobileapps to 04a6e84 (duration: 01m 17s) [00:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:51] 06Operations, 10DBA, 10MediaWiki-General-or-Unknown, 13Patch-For-Review: img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2840357 (10aaron) 05Open>03Resolved a:03aaron Makes sense. [00:06:55] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2840365 (10Huji) I think the footer should be expanded to also say "by replying to this emai... [00:07:37] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2840367 (10fgiunchedi) a:03Cmjohnson [00:11:52] 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2840368 (10Deskana) >>! In T69015#2840271, @debt wrote: > Is there a chance that this will actually be implemen... [00:17:21] 06Operations, 07Tracking: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063#2840381 (10fgiunchedi) [00:17:23] 06Operations: alternatives to racktables ? - https://phabricator.wikimedia.org/T84001#2840377 (10fgiunchedi) 05Open>03stalled Stalling until {T88424} has an outcome [00:20:08] !log ebernhardson@tin Synchronized php-1.29.0-wmf.4/extensions/PageImages/includes/ApiQueryPageImages.php: T152155: Thumbnails are not showing in search on multiple platforms (duration: 00m 45s) [00:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:20] T152155: Thumbnails are not showing in search on multiple platforms - https://phabricator.wikimedia.org/T152155 [00:25:56] (03CR) 10Aaron Schulz: [C: 032] Bump $wgJobBackoffThrottling for cache purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324388 (owner: 10Aaron Schulz) [00:26:01] (03PS2) 10Aaron Schulz: Bump $wgJobBackoffThrottling for cache purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324388 [00:26:12] (03CR) 10Aaron Schulz: [C: 032] Bump $wgJobBackoffThrottling for cache purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324388 (owner: 10Aaron Schulz) [00:26:57] (03Merged) 10jenkins-bot: Bump $wgJobBackoffThrottling for cache purges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324388 (owner: 10Aaron Schulz) [00:30:19] !log aaron@tin Synchronized wmf-config/CommonSettings.php: Bump $wgJobBackoffThrottling for cache purges (duration: 00m 45s) [00:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:02] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.333 second response time [00:37:02] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.538 second response time [00:41:02] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 1.707 second response time [00:41:41] !log demon@tin Synchronized wmf-config: Removing some old ExtensionMessages files (duration: 00m 47s) [00:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:00] (03PS1) 1020after4: phabricator: enable vcs and web user to run `git` and `ssh` via sudo [puppet] - 10https://gerrit.wikimedia.org/r/324841 [00:42:28] mutante: https://gerrit.wikimedia.org/r/#/c/324841/ [00:44:02] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.044 second response time [00:46:22] (03CR) 10Paladox: [C: 031] phabricator: enable vcs and web user to run `git` and `ssh` via sudo [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4) [00:47:00] !log ebernhardson@tin Synchronized php-1.29.0-wmf.4/extensions/PageImages/maintenance/initImageData.php: T152155: Maintenance script updates for re-initializing page images (duration: 00m 44s) [00:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:13] T152155: Thumbnails are not showing in search on multiple platforms - https://phabricator.wikimedia.org/T152155 [01:00:47] (03Abandoned) 10Tim Landscheidt: Tools: Unpuppetize host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/241582 (https://phabricator.wikimedia.org/T109485) (owner: 10Tim Landscheidt) [01:04:32] @seen yurik [01:04:32] bd808: Last time I saw yurik they were quitting the network with reason: Quit: will be back later. use g hangout if you need me N/A at 12/1/2016 10:44:41 PM (2h19m51s ago) [01:23:12] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:28:20] 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#730151 (10MaxSem) ``` ServerName m.wikipedia.<%= @domain_suffix %> ServerAlias zero.... [01:33:58] ostriches: can I get you to do a gerrit admin thing for me? [01:35:12] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1812.661916 Seconds [01:35:12] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1812.671514 Seconds [01:35:41] kaldari: I could be persuaded...what's up? [01:36:12] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 43.832433 Seconds [01:36:12] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 43.845835 Seconds [01:36:37] Could you add Samwilson and MusikAnimal to the PageAssessments ext group: https://gerrit.wikimedia.org/r/#/admin/groups/1124,members [01:37:51] kaldari: yeah gimmie a min [01:39:59] kaldari: {{done}} [01:40:21] grazie! [01:40:44] or gratzie [01:41:00] yw :) [01:41:28] kaldari: Bonus, I just leveled you up to Project And Group Creators so you can manage most groups now :) [01:41:41] even better. Thanks! [01:41:51] Kaldari gains +200 gerrit xp [01:42:13] :🍄: [01:51:12] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:53:42] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:03:02] PROBLEM - https://phabricator.wikimedia.org on iridium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2471 bytes in 0.070 second response time [02:03:22] PROBLEM - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2453 bytes in 0.053 second response time [02:03:38] wth [02:04:19] phabricator works for me ... not sure what the issue is [02:05:51] (03PS1) 1020after4: phabricator: cluster.addresses to whitelist iridium and phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324851 (https://phabricator.wikimedia.org/T137928) [02:06:45] odd, is the text there? mentioned in yhe check [02:06:53] sorry on mobile i can't check [02:07:33] no issues for me either [02:07:33] twentyafterfour: back, looking [02:07:40] i bet the text just changed [02:07:41] wfm [02:09:01] ACKNOWLEDGEMENT - https://phabricator.wikimedia.org on phab2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string focus on bug not found on https://phabricator.wikimedia.org:443https://phabricator.wikimedia.org/ - 2452 bytes in 0.051 second response time 20after4 T137928 [02:11:24] so there are two things... phab2001 has the same check - it should use a different url and should be disabled (I acknowledged that one) [02:11:30] but iridium is red too, so I don't get it [02:11:37] the check is looking for text that hasn't changed [02:12:06] ./check_http -S -H 'phabricator.wikimedia.org' -I misc-web-lb.wikimedia.org -u 'https://phabricator.wikimedia.org/' -s 'focus on bug' [02:12:10] funny timing - I thought I broke something because I was right in the middle of configuring the clustered repos [02:12:46] it's internal server error [02:12:50] also withouth that string [02:12:58] when removing the entire -s [02:13:23] but the service is alive and well [02:13:29] no 500 error for me [02:14:12] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:15:21] the checkcommand did not change either [02:15:49] i double checked if it was using any variable like the phabricator domain we moved to hiera [02:15:52] nope [02:16:23] [18:14:45] can anyone with IPv6 connection open Phabricator? [02:16:23] [18:15:00] getting this: IP address "2601:*snip*" is not properly formatted. Expected an IP address like "23.45.67.89". [02:16:27] from -dev [02:16:41] ah ha! [02:16:44] that's the problem [02:16:45] oh [02:16:53] I'm getting it too [02:16:53] was there a change to that though? [02:16:55] > IP address "2601:646:8301:6ad9:459d:7855:2a29:26bd" is not properly formatted. Expected an IP address like "23.45.67.89". [02:17:14] mutante: I was working on the clustering stuff, [02:17:48] https://gerrit.wikimedia.org/r/#/c/324851/ [02:17:49] same here [02:18:08] > Exception IP address "2601:*" is not properly formatted. Expected an IP address like "23.45.67.89". [02:18:44] twentyafterfour: ok, but that isnt merged yet and not v6 ? [02:19:03] mutante: well I made that change on phab2001, not sure how it would affect iridium though [02:19:13] and I don't get the 2601:* error? [02:19:20] I get the error as well [02:19:22] I never changed anything related to ipv6 [02:20:02] what was the change on phab2001 ? [02:20:10] does the exception mention anything more than "not properly formatted" ? [02:20:16] twentyafterfour: was there a phab update from upstream? Could be something silly in a utility function that changed upstream and not accounting for ipv6 [02:20:34] mutante: the same as the change in puppet [02:20:45] twentyafterfour: no, no other info from the error [02:21:12] Krinkle: no changes from upstream, I haven't even touched the code on iridium at all [02:21:13] twentyafterfour: let's revert that? [02:21:13] twentyafterfour: https://i.imgur.com/CuGjhQD.png [02:21:46] fatal-config-template: Unhandled Exception [02:21:47] nothing else on the page (Blank) [02:22:25] I'm running out of battery here but it seems like reverting the change twentyafterfour? and that probably says it can't understand ipv6 and assumes ipv4 for some use case? [02:22:30] the server doesnt have that IP address on an interface [02:22:37] it must be just a phab config thing [02:22:40] mutante: chasemp: no change to revert .. [02:22:42] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [02:22:49] it was a local config change on phab2001 [02:22:55] I think it toggled something in the database [02:23:00] oh boy [02:23:00] I'm tracking it down now [02:23:02] RECOVERY - https://phabricator.wikimedia.org on iridium is OK: HTTP OK: HTTP/1.1 200 OK - 27075 bytes in 0.161 second response time [02:23:05] ok [02:23:12] recovery?! [02:23:22] RECOVERY - https://phabricator.wikimedia.org on phab2001 is OK: HTTP OK: HTTP/1.1 200 OK - 27075 bytes in 0.158 second response time [02:23:29] wtf [02:23:41] for both at once [02:23:47] seems like def related to teh cluster work [02:23:56] ok I think I know what's happening [02:24:14] I migrated one unused repository to the cluster [02:24:29] which triggered some cluster-aware code to run which was previously dormant [02:24:36] and upstream doesn't test with IPv6 apparently [02:24:40] ok cool because I'm out of battery, good luck :) [02:24:44] I found a stack trace [02:24:49] thanks chase, no worries I've got this [02:24:51] (yes I thikn that must be the deal re: ipv6) [02:29:58] ok so it recovered when puppet reverted my config change on phab2001. Apparently this change gets propogated to the db somehow [02:30:11] anyone still seeing the error? [02:30:39] i dont, wfm [02:30:46] and that's what i meant by revert, yea [02:32:02] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 3.106 second response time [02:33:02] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.881 second response time [02:33:45] mutante: sorry I still was trying to figure out how a config change (not code change, not touching database, not touching iridium) could affect production. [02:33:57] I actually still can't tell what caused it to affect iridium [02:36:30] twentyafterfour: gotcha, is it trying to add that new service IP and associate the phabricator.wm name with that or something [02:36:51] i did not expect we'd get into the actual clusterting stuff yet [02:37:00] thought we are just making it a warm standby first [02:38:00] sure that everything's recovered, but I wonder what happened earlier when IPv6 visits were thrown unknown exceptions [02:38:49] (for Phabricator) [02:41:17] "Hosts in this list are allowed to bend (or even break) some of the security and policy rules when they make requests to other hosts in the cluster," [02:41:26] so it let phab2001 talk to iridium and change the config there i suppose [02:42:12] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [02:52:28] mutante: I was working on clustering repositories so that it can be a warm standby [02:52:33] I'm not trying to make it active [02:52:55] twentyafterfour: aha, just for the repo sync then [02:52:58] this phabricator clustering stuff is mostly untested outside of phacility I think [02:53:02] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.526 second response time [02:53:02] mutante: yes [02:54:02] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.362 second response time [02:54:12] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:54:24] * twentyafterfour is following stack traces to see what code might have triggered it [02:54:38] twentyafterfour: got it, *nod* [03:00:29] https://github.com/phacility/phabricator/blob/master/src/aphront/AphrontRequest.php#L567 [03:00:36] what an odd way to check https [03:00:48] * twentyafterfour had to run multiple tests to be sure what the behavior of that method would be [03:01:09] e.g. if $_SERVER['HTTPS'] == true, what does that method return? :D [03:04:56] twentyafterfour: true. because HTTPS was not empty (not set or falsey) and did not equal off [03:05:19] isHTTPS: ( https && https !== 'off' ) [03:05:21] essentially [03:21:12] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [03:30:02] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.07 seconds [03:33:02] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 147.00 seconds [03:37:02] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:39:04] !log mw1293 - upgrade imagemagick to 8:6.8.9.9-5+deb8u6+wmf1 [03:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:52] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [04:16:52] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=672.90 Read Requests/Sec=387.60 Write Requests/Sec=11.60 KBytes Read/Sec=39054.00 KBytes_Written/Sec=64.40 [04:26:52] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=4.80 Read Requests/Sec=0.80 Write Requests/Sec=0.40 KBytes Read/Sec=26.40 KBytes_Written/Sec=3.60 [04:31:12] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:43:23] Krinkle: right, it just didn't read that way to me at first glance ;) [04:43:41] and til `php -a` [04:44:09] * twentyafterfour doesn't know how I missed that for so long [04:51:47] twentyafterfour: I agree though it's not intuitively written [04:52:20] too many catch-all methods, double negatives, counter-intuitive behaviour (strcmp), and bad method names [04:52:27] yep [04:52:37] and it's a class method when it should be static ;) [04:53:01] I mean it should be static or it should not use a super-global, one or the other [04:53:39] but this isn't the code-review-phabricator channel so I'll leave it at that [04:59:12] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [05:41:05] (03PS3) 10GWicke: Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 [05:42:01] (03CR) 10GWicke: "@Filippo, moved it to ~/.config/fontconfig/fonts.conf instead. A bit verbose with the need to create the parent directories, but so be it." [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [05:49:51] !log imagescalers - upgraded imagemagick 8:6.8.9.9-5+deb8u5+wmf1 -> 8:6.8.9.9-5+deb8u6+wmf1 (https://www.debian.org/security/2016/dsa-3726) [05:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:12] !log thumbor - also got upgraded to imagemagick deb8u6+wmf1 [05:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:13] 06Operations, 10netops: asw2-d-eqiad.mgmt.eqiad - JNX_ALARMS CRITICAL - 2 red alarms, - https://phabricator.wikimedia.org/T152182#2840869 (10Dzahn) [06:24:17] (03CR) 10Dzahn: WIP: phabricator refactor init.pp (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [06:27:14] (03CR) 10Dzahn: [C: 04-1] "most of my comments are just nitpicks but there is also a typo in "class phabriccator::apache"" [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [06:31:25] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2510108 (10Dzahn) imagescalers and thumbors... [06:41:37] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2840888 (10Dzahn) {P4555} [06:43:12] PROBLEM - puppet last run on labvirt1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[htop] [07:01:42] RECOVERY - Check systemd state on phab2001 is OK: OK - running: The system is fully operational [07:04:42] PROBLEM - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:12:12] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:14:52] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:26:07] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2840908 (10Marostegui) The RAID rebuilt correctly ``` root@db2041:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 001438031205DF0) Gen8 ServBP 12+2 a... [07:26:15] 06Operations, 10ops-codfw, 10DBA: db2041: Disk RAID predictive failure - https://phabricator.wikimedia.org/T151203#2840909 (10Marostegui) 05Open>03Resolved [07:31:52] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:44:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1071 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324861 (https://phabricator.wikimedia.org/T148967) [07:46:10] (03PS2) 10Marostegui: db-eqiad.php: Depool db1071 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324861 (https://phabricator.wikimedia.org/T148967) [07:48:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1071 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324861 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [07:48:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1071 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324861 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [07:51:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1071 - T148967 (duration: 02m 22s) [07:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:07] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [07:59:50] !log Deploy alter table db1071 - dewiki.revision - T148967 [08:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:02] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [08:02:02] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:32] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:12] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:12] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:22] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 72853 bytes in 0.883 second response time [08:03:22] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:22] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:52] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:52] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:02] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 3.608 second response time [08:04:02] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.431 second response time [08:04:22] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 7.999 second response time [08:04:22] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:23] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:42] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.354 second response time [08:04:42] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.323 second response time [08:05:22] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.548 second response time [08:05:22] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 3.804 second response time [08:06:22] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 2.459 second response time [08:06:32] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:06:52] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:06:52] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:02] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:22] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:22] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:32] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:07:52] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 7.869 second response time [08:07:52] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [08:08:02] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 6.490 second response time [08:08:02] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:08:22] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:08:32] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [08:08:32] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:08:53] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.916 second response time [08:09:02] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 8.048 second response time [08:09:12] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:09:12] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:12] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 72853 bytes in 0.288 second response time [08:09:22] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:09:22] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:09:32] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:09:32] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [08:10:12] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.411 second response time [08:10:12] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [08:10:12] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 0.141 second response time [08:10:12] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.413 second response time [08:10:22] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 4.727 second response time [08:10:22] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 9.036 second response time [08:10:22] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [08:10:22] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 8.044 second response time [08:10:42] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:10:52] PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:11:32] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.845 second response time [08:11:52] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.673 second response time [08:12:22] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:13:02] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:13:22] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:13:22] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:13:53] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:02] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 6.768 second response time [08:14:12] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:14:12] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 0.132 second response time [08:14:42] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:14:52] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [08:14:52] PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:02] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:12] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.163 second response time [08:15:12] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 72852 bytes in 0.084 second response time [08:15:32] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:15:32] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.533 second response time [08:15:52] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.848 second response time [08:16:02] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 6.848 second response time [08:16:02] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:16:02] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:16:12] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [08:16:22] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:16:23] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:16:52] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 72853 bytes in 0.556 second response time [08:16:52] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:17:22] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 5.419 second response time [08:17:32] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:17:32] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.576 second response time [08:17:32] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:12] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:12] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:12] PROBLEM - Apache HTTP on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:22] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:52] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:52] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.054 second response time [08:18:52] PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:02] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.052 second response time [08:19:22] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:22] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 4.109 second response time [08:19:32] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:20:02] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 0.165 second response time [08:20:02] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:20:02] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.395 second response time [08:20:12] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:20:22] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.864 second response time [08:20:22] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 7.012 second response time [08:20:22] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.030 second response time [08:20:23] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 5.139 second response time [08:20:42] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.474 second response time [08:20:52] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:22] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:52] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.730 second response time [08:22:02] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:02] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.228 second response time [08:22:22] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:32] ???? [08:22:32] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:52] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:22:53] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 4.579 second response time [08:22:53] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 9.567 second response time [08:23:22] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.807 second response time [08:23:32] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [08:23:32] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:23:42] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 3.234 second response time [08:24:02] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:12] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:22] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:22] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [08:24:32] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:42] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:42] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.403 second response time [08:24:52] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:53] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.727 second response time [08:24:53] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 0.230 second response time [08:25:12] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [08:25:12] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:22] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 3.280 second response time [08:25:22] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:42] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.308 second response time [08:26:02] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [08:26:12] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:12] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:12] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 1.498 second response time [08:26:23] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:23] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:26:32] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.279 second response time [08:27:02] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:12] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:12] PROBLEM - HHVM rendering on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:23] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 8.710 second response time [08:27:23] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:32] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:02] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.306 second response time [08:28:03] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 72853 bytes in 0.301 second response time [08:28:12] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 9.360 second response time [08:28:12] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.441 second response time [08:28:22] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 6.595 second response time [08:28:22] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:23] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:32] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.857 second response time [08:28:52] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:52] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:28:52] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 2.486 second response time [08:29:02] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:12] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:29:22] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.581 second response time [08:29:22] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 5.776 second response time [08:29:22] PROBLEM - HHVM rendering on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:22] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 504 (expecting: 200) [08:29:32] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:42] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.038 second response time [08:29:42] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:52] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 8.215 second response time [08:29:58] ongoing outage for the api cluster, ops is working on it! [08:30:02] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:30:22] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 72853 bytes in 0.700 second response time [08:30:22] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 7.789 second response time [08:30:22] PROBLEM - HHVM rendering on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:30:32] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.241 second response time [08:30:52] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [08:30:52] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:12] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 72854 bytes in 0.228 second response time [08:31:22] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 72852 bytes in 0.082 second response time [08:31:52] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [08:31:53] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 2.140 second response time [08:31:53] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:02] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:02] PROBLEM - Apache HTTP on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:32:02] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:02] PROBLEM - HHVM rendering on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.003 second response time [08:32:12] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [08:32:12] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:12] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:22] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.486 second response time [08:32:22] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:22] RECOVERY - HHVM rendering on mw1195 is OK: HTTP OK: HTTP/1.1 200 OK - 72855 bytes in 9.750 second response time [08:32:32] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:37] 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2840986 (10Krenair) We could just split zero to a separate VHost? [08:32:52] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:52] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:52] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 5.842 second response time [08:33:02] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.034 second response time [08:33:03] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.183 second response time [08:33:12] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 9.826 second response time [08:33:12] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 72885 bytes in 0.494 second response time [08:33:12] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 72887 bytes in 1.186 second response time [08:33:22] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.725 second response time [08:33:22] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:33:22] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:34:02] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.031 second response time [08:34:32] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:35:02] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:02] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:12] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:35:12] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 72887 bytes in 2.207 second response time [08:35:22] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 8.381 second response time [08:35:22] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:32] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:32] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:42] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.104 second response time [08:35:42] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.754 second response time [08:35:52] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 72885 bytes in 0.123 second response time [08:35:52] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.936 second response time [08:35:52] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 72885 bytes in 0.733 second response time [08:35:53] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 5.252 second response time [08:35:53] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 72887 bytes in 4.383 second response time [08:36:22] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:32] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:52] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:12] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 72885 bytes in 0.764 second response time [08:37:22] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 3.712 second response time [08:37:22] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.703 second response time [08:37:22] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.012 second response time [08:37:32] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 9.255 second response time [08:37:42] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 2.471 second response time [08:37:58] !log restarting hhvm (/usr/local/bin/restart-hhvm) on G@cluster:api_appserver and G@site:eqiad (batch 10%) [08:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:22] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:23] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:32] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [08:38:42] PROBLEM - Apache HTTP on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:38:42] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:38:52] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:52] PROBLEM - Apache HTTP on mw1230 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:38:52] PROBLEM - HHVM rendering on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:02] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:02] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:02] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:12] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [08:39:12] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:12] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:12] PROBLEM - HHVM rendering on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:39:12] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.181 second response time [08:39:32] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [08:39:42] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.048 second response time [08:39:43] PROBLEM - Apache HTTP on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:39:52] PROBLEM - HHVM rendering on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:39:52] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time [08:39:52] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 72884 bytes in 0.081 second response time [08:40:12] PROBLEM - HHVM rendering on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:40:12] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.243 second response time [08:40:22] PROBLEM - HHVM rendering on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:40:52] PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:40:52] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:40:52] PROBLEM - HHVM rendering on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.003 second response time [08:41:02] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 7.017 second response time [08:41:02] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.670 second response time [08:41:12] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.197 second response time [08:41:22] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:32] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:42] RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.010 second response time [08:41:43] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 6.077 second response time [08:41:43] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.110 second response time [08:41:43] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 7.315 second response time [08:41:52] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.053 second response time [08:41:52] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.977 second response time [08:41:52] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.163 second response time [08:41:52] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 7.401 second response time [08:42:02] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.126 second response time [08:42:22] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 72887 bytes in 5.119 second response time [08:42:22] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 72887 bytes in 5.179 second response time [08:42:22] PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:42:32] PROBLEM - HHVM rendering on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:42:42] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:43:01] !log oblivian@tin Synchronized php-1.29.0-wmf.4/api.php: API bandaid (duration: 00m 48s) [08:43:02] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:22] PROBLEM - HHVM rendering on mw1192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:43:22] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.043 second response time [08:43:22] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 6.153 second response time [08:43:32] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.085 second response time [08:43:52] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 3.365 second response time [08:45:12] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.186 second response time [08:45:12] PROBLEM - HHVM rendering on mw1286 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50371 bytes in 0.002 second response time [08:45:22] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 72884 bytes in 0.092 second response time [08:45:22] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.224 second response time [08:46:37] outage should be over now [08:46:43] hhvm restarts are still ongoing [08:47:12] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 72886 bytes in 0.383 second response time [08:49:32] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2840996 (10Joe) So, this happened again this morning, and we have good and bad news: - Good news is the system, with a larger number of jemalloc arenas, too... [08:49:59] (03CR) 10Jcrespo: [C: 04-1] "This needs deeper discussion, I think the load balancer behaviour is simplistic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 (owner: 10Aaron Schulz) [08:56:54] 06Operations, 10DBA: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2841012 (10jcrespo) [08:57:49] (03PS1) 10Jcrespo: Depool db1076 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324862 (https://phabricator.wikimedia.org/T152188) [09:00:49] (03CR) 10Jcrespo: [C: 032] Depool db1076 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324862 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [09:03:36] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1076 (duration: 00m 48s) [09:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:22] PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 151 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 145, number_of_pending_tasks: 11, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 91, task_max_waiting_in_queue_millis: 1987, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [09:06:22] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 147 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 141, number_of_pending_tasks: 13, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 95, task_max_waiting_in_queue_millis: 2777, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: [09:06:31] ^ it's me [09:08:22] RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 211, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 242, initial [09:08:22] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 211, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 242, initial [09:09:51] (03PS1) 10Marostegui: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324864 (https://phabricator.wikimedia.org/T150644) [09:12:29] (03CR) 10Aaron Schulz: "Since warmCacheRatio default to 0, this will actually no-op as is (a prior weights)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 (owner: 10Aaron Schulz) [09:13:27] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324864 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [09:14:02] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324864 (https://phabricator.wikimedia.org/T150644) (owner: 10Marostegui) [09:15:48] (03PS1) 10Giuseppe Lavagetto: mediawiki::hhvm: bump up the number of malloc arenas [puppet] - 10https://gerrit.wikimedia.org/r/324866 (https://phabricator.wikimedia.org/T151702) [09:16:23] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2045 - T150644 (duration: 00m 47s) [09:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:36] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [09:17:47] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::hhvm: bump up the number of malloc arenas [puppet] - 10https://gerrit.wikimedia.org/r/324866 (https://phabricator.wikimedia.org/T151702) (owner: 10Giuseppe Lavagetto) [09:18:03] !log Deploy alter table wikidatawiki.revision in db2045 -T150644 [09:18:08] !log mysql restart and upgrade for db1076 T152188 [09:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:25] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [09:24:58] (03PS1) 10Jcrespo: Repool db1076 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324867 (https://phabricator.wikimedia.org/T152188) [09:25:45] (03PS2) 10Jcrespo: Repool db1076 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324867 (https://phabricator.wikimedia.org/T152188) [09:28:27] paravoid, hi, around? We ran into a big mapnik issue yesterday - could you take a look? https://phabricator.wikimedia.org/T152131 [09:28:56] (03CR) 10Jcrespo: "Independently of what is the long term plan, the query cache should never be queried about this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 (owner: 10Aaron Schulz) [09:28:59] paravoid, apparently no one knew (or has access to) the deb build server to check the logs [09:29:57] <_joe_> yurik: define "no one" [09:30:26] _joe_, gehel is sick, the rest of us don't have access :) [09:40:15] (03CR) 10Jcrespo: "To be more constructive, uptime (the mysql parameter) would be a more reliable method, throttlin connections in the first 1-2 hours." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 (owner: 10Aaron Schulz) [09:51:11] (03CR) 10Jcrespo: [C: 032] Repool db1076 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324867 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [09:52:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1076 with low load (duration: 00m 49s) [09:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:14] (03PS1) 10Jcrespo: Depool db1074 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324869 (https://phabricator.wikimedia.org/T152188) [10:04:18] 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2841133 (10Dereckson) a:05Dereckson>03None [10:05:16] 06Operations, 06Discovery, 10Wikimedia-Apache-configuration, 07Mobile, 13Patch-For-Review: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#730151 (10Dereckson) >>! In T69015#2840986, @Krenair wrote: > We could just split zero to a separate VHost? Ye... [10:07:41] (03CR) 10Jcrespo: [C: 032] Depool db1074 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324869 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [10:09:02] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:09:39] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 (duration: 00m 45s) [10:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:37] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324873 [10:25:12] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324873 [10:26:21] (03PS1) 10Jcrespo: Repool db1076 with full load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324874 (https://phabricator.wikimedia.org/T152188) [10:27:46] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841144 (10Joe) Looking at `api.log`, I found that requests as follows: - to euwiki - `action=parsoid-batch` - `batch-action=preprocess` have had an absurd... [10:28:44] !log mysql restart and upgrade for db1074 T152188 [10:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:55] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [10:30:11] (03CR) 10Jcrespo: [C: 032] Repool db1076 with full load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324874 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [10:30:48] !log Deploy alter table wikidatawiki.revision in db2052 -T150644 [10:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:59] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [10:31:22] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 111 probes of 416 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [10:31:26] (03Merged) 10jenkins-bot: Repool db1076 with full load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324874 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [10:34:42] 06Operations, 10Traffic: Block hotlinking - https://phabricator.wikimedia.org/T152091#2841156 (10Gilles) Great points @valhallasw. It'd be nice to know how much it really costs, though, even if it's already priced into the current infrastructure. When I heard that it represented a large share of our traffic i... [10:34:53] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1076 with full load (duration: 00m 44s) [10:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:06] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324873 (owner: 10Marostegui) [10:35:46] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2045" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324873 (owner: 10Marostegui) [10:36:53] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2045 - T150644 (duration: 00m 44s) [10:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:05] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [10:37:52] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:40:35] (03PS1) 10Elukey: Refactor the monitor namespace to include Statsd [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324877 (https://phabricator.wikimedia.org/T152093) [10:41:40] (03PS2) 10Elukey: Refactor the monitor namespace to include Statsd [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324877 (https://phabricator.wikimedia.org/T152093) [10:46:14] (03PS1) 10Jcrespo: Repool db1074 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324879 (https://phabricator.wikimedia.org/T152188) [10:46:35] <_joe_> jouncebot: next [10:46:35] In 75 hour(s) and 13 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T1400) [10:46:56] _joe_, I am doing some poolings/depoolings [10:47:03] does it bother you? [10:47:09] <_joe_> jynus: nope [10:47:22] <_joe_> I just added a patch to wmf4 manually [10:47:27] <_joe_> hashar: ^^ [10:47:56] s/bother/affect/ but I think you got it [10:48:11] o/ [10:48:18] <_joe_> yeah, as long as you stay in wmf-config, that's ok [10:48:26] yep [10:48:37] <_joe_> hashar: so, should I add it to the list of patches on tin? [10:49:11] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2841175 (10hashar) Apparently it is gone for real. deployment-apertium01 hasn't reappeared and does not show up in the Horizon interface. [10:49:29] patch of what ? [10:49:39] if you need a private patch, we put them in /srv/patches [10:49:49] so they will be reapplied on the next branch cut on Tuesday [10:50:45] (03CR) 10Jcrespo: [C: 032] Repool db1074 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324879 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [10:51:36] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841177 (10Joe) {F4939497} shows the rate of such requests [10:52:23] (03CR) 10Ema: [C: 031] Refactor the monitor namespace to include Statsd [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324877 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [10:53:05] (03CR) 10Elukey: [C: 032] Refactor the monitor namespace to include Statsd [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324877 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [10:56:22] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 416 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [10:58:23] (03Merged) 10jenkins-bot: Repool db1074 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324879 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [10:59:01] (03PS1) 10Elukey: Add a separate parameter for the statsd port [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324880 (https://phabricator.wikimedia.org/T152093) [11:00:18] (03Abandoned) 10Elukey: Add a separate parameter for the statsd port [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324880 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [11:03:38] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Match cache headers between thumbor and mediawiki - https://phabricator.wikimedia.org/T150642#2841207 (10Gilles) [11:04:02] 06Operations, 06Performance-Team, 10Thumbor: Thumbor should handle "temp" thumbnail requests - https://phabricator.wikimedia.org/T151441#2841209 (10Gilles) [11:07:36] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2841217 (10Gilles) [11:07:39] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Nginx time limit should be a bit higher than Thumbor subprocess time limit - https://phabricator.wikimedia.org/T151459#2841216 (10Gilles) 05Open>03Resolved [11:08:20] (03PS1) 10Marostegui: db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324881 (https://phabricator.wikimedia.org/T148967) [11:09:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324881 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [11:09:44] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324881 (https://phabricator.wikimedia.org/T148967) (owner: 10Marostegui) [11:10:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1071 - T148967 (duration: 00m 45s) [11:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:57] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [11:11:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1074 with low load (duration: 00m 45s) [11:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:14] jynus: we might have deployed at the same time: ˜/logmsgbot 12:10> !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1071 - T148967 (duration: 00m 45s) ˜/logmsgbot 12:11> !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1074 with low load (duration: 00m 45s) [11:14:22] (03PS1) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [11:14:32] yeah, there is a lock I hit [11:14:36] so that is no problem [11:15:13] https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php -> my change didn't go thru though [11:15:23] did you rebase? [11:15:33] (03CR) 10jenkins-bot: [V: 04-1] Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [11:16:06] no, looks like I didn't :( [11:16:09] I will do it now [11:16:14] and deploy a new one [11:16:15] :-) [11:16:56] no, I rebased [11:17:05] but you deployed the wrong file? [11:17:38] actually [11:17:43] no, I deployed db-eqiad.php [11:17:48] your change is not merged [11:18:14] https://gerrit.wikimedia.org/r/#/c/324881/ it said it was [11:18:25] merged on tin [11:18:38] it is merged on gerrit [11:18:48] but it is not on the log of merged items on tin [11:18:52] interesting [11:19:00] maybe I didn't merge it, that could be [11:19:02] let me check my history [11:19:22] I was in the wrong terminal when I did the rebase indeed (on my local one) [11:19:42] ok, i will do it [11:19:42] there was no db1071 on the log [11:20:05] yes you are right, I did a rebase on my local terminal and not in tin, and then switched to tin [11:20:24] now it looks good [11:20:35] (03PS2) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [11:21:25] (03CR) 10jenkins-bot: [V: 04-1] Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [11:21:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1071 - T148967 (duration: 00m 44s) [11:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:50] T148967: Fix PK on S5 dewiki.revision - https://phabricator.wikimedia.org/T148967 [11:22:49] (03PS3) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [11:23:07] * elukey waits for jenkins [11:23:14] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841250 (10Joe) There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests from parsoid happened. Not sure if that's rela... [11:25:02] (03PS4) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [11:27:37] (03PS5) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [11:27:49] luckly the pcc helps me during fridays [11:36:13] (03PS1) 10Jcrespo: Depool db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324885 (https://phabricator.wikimedia.org/T152188) [11:38:50] (03CR) 10Jcrespo: [C: 032] Depool db1060 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324885 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:39:42] !log Stop MySQL db1095 for maintenance - T150802 [11:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:53] T150802: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802 [11:40:47] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1060 (duration: 00m 45s) [11:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:05] 06Operations, 10Wikidata, 10Wikimedia-Extension-setup, 15User-Addshore: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#2841313 (10Addshore) [11:41:38] 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#2841314 (10Addshore) [11:41:59] 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#2776652 (10Addshore) [11:42:42] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [11:43:22] (03PS1) 10Jcrespo: Pool db1074 with full load after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324886 (https://phabricator.wikimedia.org/T152188) [11:43:36] (03PS1) 10Elukey: Fix ganglia/statsd class namespace [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324887 (https://phabricator.wikimedia.org/T152093) [11:43:51] the icinga issue is probably an downtime expired [11:43:59] (non-production affecting) [11:44:11] I was actually going to work on that today [11:46:56] (03CR) 10Elukey: [C: 032] Fix ganglia/statsd class namespace [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324887 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [11:48:33] (03PS1) 10Jcrespo: Really depool db1060, unlike 2 patches ago [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324888 (https://phabricator.wikimedia.org/T152188) [11:48:45] ^and this is why checking processlist is a must [11:49:07] (03CR) 10Jcrespo: [C: 032] Pool db1074 with full load after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324886 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:49:23] (03CR) 10Jcrespo: [C: 032] Really depool db1060, unlike 2 patches ago [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324888 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:49:54] (03Merged) 10jenkins-bot: Pool db1074 with full load after warmup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324886 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:49:59] (03Merged) 10jenkins-bot: Really depool db1060, unlike 2 patches ago [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324888 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [11:51:55] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Really depool db1060 && pool db1074 with full load after warmup (duration: 00m 44s) [11:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:36] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2841327 (10jcrespo) a:03jcrespo [11:56:38] !log mysql restart for db1060 T152188 [11:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:50] T152188: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188 [11:56:58] (03PS6) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [12:00:43] (03PS1) 10Jcrespo: Remove views.pp from labsdb role, duplicate of labs::db::views [puppet] - 10https://gerrit.wikimedia.org/r/324889 [12:00:50] (03PS1) 10Elukey: Fix class dependency for ganglia/statsd monitoring [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324890 (https://phabricator.wikimedia.org/T152093) [12:01:20] (03CR) 10Elukey: [C: 032] Fix class dependency for ganglia/statsd monitoring [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324890 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [12:02:22] (03CR) 10Jcrespo: "~/puppet$ git grep 'role::labsdb'" [puppet] - 10https://gerrit.wikimedia.org/r/324889 (owner: 10Jcrespo) [12:02:24] (03PS7) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [12:05:02] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/4765/cp4004.ulsfo.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [12:09:24] 06Operations, 06Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#2841366 (10mark) >>! In T95714#2207876, @Andrew wrote: > This ticket has a terrible, unclear title, and even after reading the ticket I'm not 100% sure what it's about. Agreed. :) > I'm pret... [12:09:29] (03PS1) 10Elukey: Fix statsd logster job name [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324891 (https://phabricator.wikimedia.org/T152093) [12:10:05] (03CR) 10Marostegui: [C: 031] "Looks good: https://puppet-compiler.wmflabs.org/4768/" [puppet] - 10https://gerrit.wikimedia.org/r/324889 (owner: 10Jcrespo) [12:10:34] (03CR) 10Elukey: [C: 032] Fix statsd logster job name [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/324891 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [12:10:49] 06Operations, 06Commons, 10Monitoring, 10media-storage: Monitor [[Special:ListFiles]] for non 200 HTTP statuses in thumbnails - https://phabricator.wikimedia.org/T106937#2841371 (10mark) I'd say, let's set it up and see how much it costs. We can also vary the check frequency. [12:11:34] (03PS8) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [12:14:44] (03PS9) 10Elukey: Switch Varnishkafka monitoring from Ganglia to statsd [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) [12:17:38] (03CR) 10Elukey: "PCC looks finally good!" [puppet] - 10https://gerrit.wikimedia.org/r/324883 (https://phabricator.wikimedia.org/T152093) (owner: 10Elukey) [12:19:25] 06Operations, 10DBA: install/deploy dbproxy1003 through dbproxy1011 - https://phabricator.wikimedia.org/T86958#2841391 (10jcrespo) 05Open>03Resolved a:03jcrespo This was done long time ago, although more work is probably needed in the future. [12:19:35] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841394 (10Joe) This brust in traffic is, looking at parsoid logs, due to `reqId: 3ff62f51-cd11-4b44-98e4-6a6aa608b600` from ChangePropagation. I am unsure... [12:20:57] (03PS1) 10Jcrespo: Deploy HAProxy to the new labsdb proxies [puppet] - 10https://gerrit.wikimedia.org/r/324893 (https://phabricator.wikimedia.org/T141097) [12:25:13] (03PS1) 10Jcrespo: Repool db1060 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324894 (https://phabricator.wikimedia.org/T152188) [12:25:40] (03CR) 10Jcrespo: [C: 04-2] "Wait for replication catchup + buffer pool load" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324894 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [12:27:09] (03CR) 10Jcrespo: [C: 032] Deploy HAProxy to the new labsdb proxies [puppet] - 10https://gerrit.wikimedia.org/r/324893 (https://phabricator.wikimedia.org/T141097) (owner: 10Jcrespo) [12:30:00] ACKNOWLEDGEMENT - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo labsdb1010 under maintenance [12:34:16] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2825098 (10mobrovac) >>! In T151702#2841250, @Joe wrote: > There was an edit of Modulu:Wikidata this morning about 20 minutes before the peak of requests fro... [12:35:39] ACKNOWLEDGEMENT - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo labsdb1010 under maintenance [12:45:39] (03CR) 10Mobrovac: [C: 04-1] Add fontconfig file for the pdf render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [13:07:23] (03CR) 10Jcrespo: [C: 032] Repool db1060 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324894 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [13:07:54] (03Merged) 10jenkins-bot: Repool db1060 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324894 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [13:13:18] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:13:44] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841477 (10Reedy) I've indefinitely protected (to sysop) https://eu.wikipedia.org/wiki/Modulu:Wikidata for now, and left a message at https://eu.wikipedia.or... [13:16:37] (03PS1) 10Jcrespo: Enable new TLS certs on labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/324899 (https://phabricator.wikimedia.org/T152194) [13:23:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1060 (duration: 00m 45s) [13:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:31] (03CR) 10Marostegui: [C: 031] Enable new TLS certs on labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/324899 (https://phabricator.wikimedia.org/T152194) (owner: 10Jcrespo) [13:24:58] (03CR) 10Jcrespo: [C: 032] Enable new TLS certs on labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/324899 (https://phabricator.wikimedia.org/T152194) (owner: 10Jcrespo) [13:33:57] (03PS1) 10Jcrespo: Update new labsdb configuration template [puppet] - 10https://gerrit.wikimedia.org/r/324900 (https://phabricator.wikimedia.org/T152194) [13:35:40] (03CR) 10Jcrespo: [C: 032] Update new labsdb configuration template [puppet] - 10https://gerrit.wikimedia.org/r/324900 (https://phabricator.wikimedia.org/T152194) (owner: 10Jcrespo) [13:42:18] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:45:12] !log Deploy alter table wikidatawiki.revision in db2059 -T150644 [13:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:25] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [13:48:24] (03CR) 10Mobrovac: "LGTM, but I'm not sure why you needed to revert the addition of the wiki in the first place." [puppet] - 10https://gerrit.wikimedia.org/r/324766 (https://phabricator.wikimedia.org/T151570) (owner: 10Alex Monk) [13:48:28] (03CR) 10Mobrovac: [C: 031] Revert "Revert "RESTBase configuration for fi.wikivoyage.org"" [puppet] - 10https://gerrit.wikimedia.org/r/324766 (https://phabricator.wikimedia.org/T151570) (owner: 10Alex Monk) [13:48:58] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:54:36] (03PS1) 10Jcrespo: Add dns alias for analytics and web requests labsdb service [dns] - 10https://gerrit.wikimedia.org/r/324905 (https://phabricator.wikimedia.org/T141097) [13:55:32] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, 15User-Addshore: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#2841499 (10aude) [13:56:59] (03PS2) 10Jcrespo: Add dns alias for analytics and web requests labsdb service [dns] - 10https://gerrit.wikimedia.org/r/324905 (https://phabricator.wikimedia.org/T141097) [13:57:27] (03CR) 10Jcrespo: "Let's discuss this today." [dns] - 10https://gerrit.wikimedia.org/r/324905 (https://phabricator.wikimedia.org/T141097) (owner: 10Jcrespo) [14:06:58] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#2841566 (10jcrespo) p:05Normal>03Low [14:11:50] 06Operations, 06TCB-Team, 10Two-Column-Edit-Conflict-Merge, 15User-Addshore, 03WMDE-QWERTY-Team-Board: Deploy TwoColConflict extension to production - https://phabricator.wikimedia.org/T150184#2841578 (10Tobi_WMDE_SW) [14:11:56] (03PS1) 10Jcrespo: mariadb: Update dbstores to use the latest TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/324908 (https://phabricator.wikimedia.org/T152188) [14:17:58] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:19:04] 06Operations, 10LDAP-Access-Requests, 06TCB-Team, 10Wikidata, 03WMDE-QWERTY-Team-Board: Add Andrew and Aleksey to ldap/wmde group - https://phabricator.wikimedia.org/T152088#2841581 (10Tobi_WMDE_SW) [14:27:18] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:27:25] (03PS1) 10Mobrovac: RESTBase: Add the Citoid host/port combo to the config [puppet] - 10https://gerrit.wikimedia.org/r/324911 (https://phabricator.wikimedia.org/T108646) [14:29:26] (03CR) 10Marostegui: [C: 031] mariadb: Update dbstores to use the latest TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/324908 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [14:33:26] !log Deploy alter table wikidatawiki.revision in db2066 - T150644 [14:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:39] T150644: Wikidatawiki revision table needs unification - https://phabricator.wikimedia.org/T150644 [14:34:48] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2841598 (10jcrespo) [14:35:20] 06Operations, 10ops-eqiad, 10netops: asw2-d-eqiad.mgmt.eqiad - JNX_ALARMS CRITICAL - 2 red alarms, - https://phabricator.wikimedia.org/T152182#2841599 (10faidon) Thanks @Dzahn. The alerts seem to be: ``` 2016-12-01 16:04:45 UTC Major FPC 1 PEM 0 is not powered 2016-12-01 16:04:41 UTC Major Management Eth... [14:36:54] (03CR) 10Jcrespo: [C: 032] mariadb: Update dbstores to use the latest TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/324908 (https://phabricator.wikimedia.org/T152188) (owner: 10Jcrespo) [14:37:23] (03PS1) 10Jcrespo: [WIP] Starting to cleanup mariadb templating structure [puppet] - 10https://gerrit.wikimedia.org/r/324915 (https://phabricator.wikimedia.org/T93645) [14:37:33] (03CR) 10Rush: [C: 031] "sure makes sense. We have the ability to invalid caching at a few layers for Labs in case of failover to shepherd this along too." [dns] - 10https://gerrit.wikimedia.org/r/324905 (https://phabricator.wikimedia.org/T141097) (owner: 10Jcrespo) [14:41:09] ^he connected [14:41:49] \o/ [14:43:13] (03PS1) 10Cmjohnson: Adding production dns for restbase1016-18 T150964 [dns] - 10https://gerrit.wikimedia.org/r/324917 [14:43:24] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2841618 (10faidon) > Beyond the explicitly stated objectives for RESTBase has been a mandate to store (more or... [14:43:37] (03PS1) 10Aude: Update interwiki map for fiwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324918 (https://phabricator.wikimedia.org/T152201) [14:44:37] !log oblivian@tin Synchronized php-1.29.0-wmf.4/api.php: Reverting the API block after template has been protected (duration: 00m 45s) [14:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:11] Krenair: around? [14:45:14] or addshore ? [14:45:50] hey! [14:45:57] https://gerrit.wikimedia.org/r/#/c/324918/ needs to be deployed [14:46:09] i would do it now, but suppose it can wait for swat or krenair to do it [14:46:25] (03CR) 10Giuseppe Lavagetto: "Change seems ok but:" [puppet] - 10https://gerrit.wikimedia.org/r/324911 (https://phabricator.wikimedia.org/T108646) (owner: 10Mobrovac) [14:46:27] I'm just about to step out of the door! [14:46:29] ok [14:46:39] * aude sees when swat is [14:47:08] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [14:47:30] or it's friday [14:47:32] so nevermind [14:47:43] :D [14:47:54] but someone could probably deploy this [14:48:15] it's there if someone wants, but i'm not back until later [14:48:49] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 4 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2841635 (10mobrovac) >>! In T108646#2761433, @Pchelolo wrote: > Moving to blocked until the `basefields` question is resolved. We have decided on the PR that we shall... [14:49:48] (03CR) 10Mobrovac: "It's not blocked in reality, we discussed the blocking issues on the PR in question. I also added the Operations tag to the ticket. What d" [puppet] - 10https://gerrit.wikimedia.org/r/324911 (https://phabricator.wikimedia.org/T108646) (owner: 10Mobrovac) [14:52:50] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2841648 (10faidon) Citoid is already fronted by our #Traffic infrastructure (Varnish), which is obviously a layer capable of caching, with cache hits there being obvio... [14:53:51] (03CR) 10Giuseppe Lavagetto: [C: 032] "while this specific patchset is not an issue in itself (making RB able to serve requests to citoid is uncontroversial), adding restbase in" [puppet] - 10https://gerrit.wikimedia.org/r/324911 (https://phabricator.wikimedia.org/T108646) (owner: 10Mobrovac) [14:54:51] (03PS2) 10Giuseppe Lavagetto: RESTBase: Add the Citoid host/port combo to the config [puppet] - 10https://gerrit.wikimedia.org/r/324911 (https://phabricator.wikimedia.org/T108646) (owner: 10Mobrovac) [14:55:18] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [14:58:03] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841655 (10Joe) I removed the bandaid right now, hoping we didn't miss the origin of the issue. I would still like the concurrency limit of Change Propagati... [15:02:28] <_joe_> !log rolling restart of API appservers to catch up with the new jemalloc arenas config T151702 [15:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:40] T151702: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702 [15:09:41] (03PS10) 10Andrew Bogott: Keystone: open up firewall to allow labs access to keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) [15:10:22] (03CR) 10Cmjohnson: [C: 032] Adding production dns for restbase1016-18 T150964 [dns] - 10https://gerrit.wikimedia.org/r/324917 (owner: 10Cmjohnson) [15:11:17] (03CR) 10Andrew Bogott: [C: 032] Keystone: open up firewall to allow labs access to keystone API [puppet] - 10https://gerrit.wikimedia.org/r/320787 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [15:11:47] (03PS3) 10Giuseppe Lavagetto: RESTBase: Add the Citoid host/port combo to the config [puppet] - 10https://gerrit.wikimedia.org/r/324911 (https://phabricator.wikimedia.org/T108646) (owner: 10Mobrovac) [15:11:49] (03PS1) 10Gilles: Upgrade to 0.1.30 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/324919 (https://phabricator.wikimedia.org/T150758) [15:11:51] (03CR) 10Giuseppe Lavagetto: [V: 032] RESTBase: Add the Citoid host/port combo to the config [puppet] - 10https://gerrit.wikimedia.org/r/324911 (https://phabricator.wikimedia.org/T108646) (owner: 10Mobrovac) [15:12:18] <_joe_> ok to merge your changes jynus andrewbogott [15:12:19] <_joe_> ? [15:12:32] _joe_: yes please [15:12:39] yes [15:12:48] <_joe_> done [15:13:08] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [15:16:44] (03Abandoned) 10Elukey: Add extra compiler warnings to the Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/314662 (https://phabricator.wikimedia.org/T147436) (owner: 10Elukey) [15:16:50] (03Abandoned) 10Elukey: Move definitions to header files for a better code readability [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322256 (https://phabricator.wikimedia.org/T147440) (owner: 10Elukey) [15:16:57] (03Abandoned) 10Elukey: Refactor the parsing functions out of the main C file [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/322257 (https://phabricator.wikimedia.org/T147440) (owner: 10Elukey) [15:17:08] PROBLEM - Check whether ferm is active by checking the default input chain on labcontrol1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:21:44] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Ganglia varnishkafka python module crashing repeatedly - https://phabricator.wikimedia.org/T152093#2841735 (10elukey) a:03elukey [15:21:58] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:18] PROBLEM - Check whether ferm is active by checking the default input chain on labcontrol1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [15:26:27] <_joe_> subbu: the latest findings on T151702 might be of interest to you [15:26:27] T151702: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702 [15:27:05] <_joe_> subbu: the TL;DR is that parsoid bombs the mediawiki API with a crapton requests, but it's basically proxying the requests it gets from restbase/changeprop [15:27:18] (03PS4) 10Andrew Bogott: Labs: Add observerenv.sh, helper script for read-only creds [puppet] - 10https://gerrit.wikimedia.org/r/320830 (https://phabricator.wikimedia.org/T150092) [15:27:20] (03PS1) 10Andrew Bogott: Keystone: fix formatting of ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/324921 [15:27:29] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841758 (10GWicke) Traditionally, a big issue causing work amplification has been a lack of reliable request timeout support in the MediaWiki API, which is t... [15:27:30] <_joe_> so there is no strange behaviour on parsoid's part that I could observe [15:27:57] _joe_, i responded on the ticket earlier this morning. [15:28:05] thanks for the heads up though. [15:28:08] <_joe_> sorry, didn't see it [15:28:12] <_joe_> thanks! [15:28:47] (03CR) 10Andrew Bogott: [C: 032] Keystone: fix formatting of ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/324921 (owner: 10Andrew Bogott) [15:30:03] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841772 (10Joe) >>! In T151702#2841758, @GWicke wrote: > Traditionally, a big issue causing work amplification has been a lack of reliable request timeout su... [15:30:08] RECOVERY - Check whether ferm is active by checking the default input chain on labcontrol1001 is OK: OK ferm input default policy is set [15:30:20] Is there /not/ a user that we can just block to avoid that T151702 gets worse? [15:31:31] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841774 (10GWicke) > Except in this specific case changeprop/restbase fire out 23K requests for a specific transclusion in the span of less than one minute... [15:33:44] Elitre, Reedy has already protected that template. [15:34:40] sure, I just wondered if there's a person behind who has no idea what they are doing - they could keep doing damage in other ways :) [15:34:56] _joe_: on wednesday you !log-ged some HHVM upgrades. what version are we running now? [15:35:11] (03CR) 10Andrew Bogott: [C: 032] Labs: Add observerenv.sh, helper script for read-only creds [puppet] - 10https://gerrit.wikimedia.org/r/320830 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [15:35:16] <_joe_> MatmaRex: 3.12.7~wmf4 [15:35:20] _joe_: not 3.12.11 yet? i'm asking for https://phabricator.wikimedia.org/T148606 [15:35:21] ah, okay [15:35:49] <_joe_> MatmaRex: it's mostly 3.12.11 though [15:35:49] (03PS4) 10GWicke: Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 [15:36:23] _joe_: it looks like not entirely 3.12.11, because those files still don't want to thumbnail :) [15:36:43] <_joe_> MatmaRex: yeah, sorry, I literally had zero time for looking at that [15:36:57] (03CR) 10jenkins-bot: [V: 04-1] Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [15:37:01] yeah, i know [15:37:13] it's not high priority, i was just wondering [15:38:06] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841784 (10ssastry) >>! In T151702#2841774, @GWicke wrote: >> Except in this specific case changeprop/restbase fire out 23K requests for a specific transc... [15:38:09] !log mobrovac@tin Starting deploy [changeprop/deploy@8f53dc6]: (no message) [15:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:25] (03PS5) 10GWicke: Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 [15:39:02] !log mobrovac@tin Finished deploy [changeprop/deploy@8f53dc6]: (no message) (duration: 00m 54s) [15:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:20] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841785 (10Joe) Well, from the MediaWiki prespective, those requests come from parsoid. From the parsoid prespective, they come from ChangePropagation via r... [15:40:02] (03CR) 10Mobrovac: [C: 031] Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [15:45:20] (03PS1) 10Mobrovac: RESTBase: Fix the Citoid URI for BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/324926 (https://phabricator.wikimedia.org/T108646) [15:47:58] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:50:26] (03CR) 10Elukey: [C: 032] RESTBase: Fix the Citoid URI for BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/324926 (https://phabricator.wikimedia.org/T108646) (owner: 10Mobrovac) [15:51:18] RECOVERY - Check whether ferm is active by checking the default input chain on labcontrol1002 is OK: OK ferm input default policy is set [15:53:28] PROBLEM - restbase endpoints health on xenon is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [15:53:36] that's me ^ [15:55:08] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [15:55:40] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841818 (10GWicke) > So, he is saying the originator of the high concurrency rate is CP which is why I added my comment earlier about spreading out CP's requ... [16:00:18] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:36] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841822 (10ssastry) >>! In T151702#2841818, @GWicke wrote: >> So, he is saying the originator of the high concurrency rate is CP which is why I added my comm... [16:04:16] (03PS1) 10Cmjohnson: Adding dhcpd entries for restbase1016-18 T150964 [puppet] - 10https://gerrit.wikimedia.org/r/324927 [16:04:19] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [16:05:32] (03CR) 10Jcrespo: [C: 032] Add dns alias for analytics and web requests labsdb service [dns] - 10https://gerrit.wikimedia.org/r/324905 (https://phabricator.wikimedia.org/T141097) (owner: 10Jcrespo) [16:05:40] (03PS3) 10Jcrespo: Add dns alias for analytics and web requests labsdb service [dns] - 10https://gerrit.wikimedia.org/r/324905 (https://phabricator.wikimedia.org/T141097) [16:05:50] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd entries for restbase1016-18 T150964 [puppet] - 10https://gerrit.wikimedia.org/r/324927 (owner: 10Cmjohnson) [16:06:48] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:08:52] (03CR) 10Paladox: [C: 031] "This should stop ipv6 from being used :)" [puppet] - 10https://gerrit.wikimedia.org/r/324851 (https://phabricator.wikimedia.org/T137928) (owner: 1020after4) [16:08:57] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841841 (10GWicke) [16:10:01] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841846 (10Paladox) Users are reporting problems with watchlist in #wikipedia-en [16:11:01] (03PS1) 10Andrew Bogott: Keystone hook: Change project id to == project name [puppet] - 10https://gerrit.wikimedia.org/r/324928 (https://phabricator.wikimedia.org/T150091) [16:11:22] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841850 (10Paladox) Users have also reported it on-wiki see https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Several_technical_problems please. [16:14:05] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841851 (10GWicke) To illustrate using [the RESTBase dashboard for the outage time frame](https://grafana.wikimedia.org/dashboard/db/restbase?from=1480204640... [16:16:28] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [16:19:28] PROBLEM - restbase endpoints health on xenon is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [16:26:00] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2841876 (10GWicke) @faidon: Request rates are very low, and have a wide spread. By the time a given URL or DOI is requested again, the response will very likely have f... [16:26:13] 06Operations, 06Parsing-Team, 06Release-Engineering-Team, 07HHVM, and 3 others: API cluster failure / OOM - https://phabricator.wikimedia.org/T151702#2841877 (10matmarex) @paladox That is definitely a separate issue, not related. [16:26:28] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [16:27:48] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:18] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:29:56] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2841895 (10elukey) Would this be a good topic to discuss during the upcoming Dev Summit? https://phabricator... [16:30:43] 07Puppet: realm.pp: "Data retrieved from Toolsbeta is String not Hash" if not defined in Hiera - https://phabricator.wikimedia.org/T152142#2839838 (10scfc) For me, this error disappeared after a few Puppet runs, i. e. time-based. I'll try `puppet agent -d -t` the next time to see if that gives any helpful infor... [16:31:28] PROBLEM - restbase endpoints health on xenon is CRITICAL: Generic error: NoneType object has no attribute __getitem__ [16:32:08] PROBLEM - MegaRAID on db1033 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:32:10] ACKNOWLEDGEMENT - MegaRAID on db1033 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T152214 [16:32:13] 06Operations, 10ops-eqiad: Degraded RAID on db1033 - https://phabricator.wikimedia.org/T152214#2841899 (10ops-monitoring-bot) [16:32:18] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:32:37] 06Operations, 10ops-eqiad: Degraded RAID on db1033 - https://phabricator.wikimedia.org/T152214#2841903 (10Marostegui) p:05Triage>03Normal [16:34:32] 07Puppet: realm.pp: "Data retrieved from Toolsbeta is String not Hash" if not defined in Hiera - https://phabricator.wikimedia.org/T152142#2841915 (10scfc) Ha, hybris bit me. The error just reoccured, and `-d` is not helpful: ``` scfc@toolsbeta-exec-1401:~$ sudo puppet agent -d -d -d -t 2>&1 | tee /tmp/puppet.... [16:35:48] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:35:49] 06Operations, 10ops-eqiad: Degraded RAID on db1033 - https://phabricator.wikimedia.org/T152214#2841899 (10Marostegui) This is indeed degraded ``` root@db1033:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary... [16:37:25] (03PS1) 10Jcrespo: prometheus: Remove labsdb1008- it was renamed to db1095 [puppet] - 10https://gerrit.wikimedia.org/r/324930 [16:40:29] 07Puppet, 06Labs, 10Labs-Infrastructure: realm.pp: "Data retrieved from Toolsbeta is String not Hash" if not defined in Hiera - https://phabricator.wikimedia.org/T152142#2841923 (10scfc) The error message appears in: ``` [tim@passepartout ~/src/operations/puppet]$ git grep 'Reading data from' modules/admin/... [16:46:25] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Move novaobserver (and novaadmin) users out of ldap - https://phabricator.wikimedia.org/T152215#2841937 (10Andrew) [16:46:28] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [16:47:08] RECOVERY - restbase endpoints health on praseodymium is OK: All endpoints are healthy [16:47:09] 06Operations, 10Cassandra, 10RESTBase, 06Services (doing): RESTBase k-r-v as Cassandra anti-pattern (or: revision retention policies considered harmful) - https://phabricator.wikimedia.org/T144431#2841953 (10GWicke) >>! In T144431#2841618, @faidon wrote: >> Beyond the explicitly stated objectives for RESTB... [16:49:10] !log restbase deploy start of 1651e35 [16:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:16] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 2 others: Provide instance level ro access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2841969 (10chasemp) [16:55:48] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:56:36] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, and 2 others: Provide read-only access to OpenStack APIs from WMF IP space - https://phabricator.wikimedia.org/T150092#2841976 (10chasemp) [16:57:28] (03CR) 10Reedy: [C: 031] "So this should be good to go now. Only (minor) query, is whether there's a better day of the month to run this than the 1st" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [16:58:56] 06Operations, 10ops-eqiad, 10netops: asw2-d-eqiad.mgmt.eqiad - JNX_ALARMS CRITICAL - 2 red alarms, - https://phabricator.wikimedia.org/T152182#2841978 (10Cmjohnson) I fixed the power issue but not sure what this error is referencing. Everything looks up based on physical inspection. 2016-12-01 16:04:41 UTC... [17:01:44] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2789590 (10mobrovac) This happened to me today: ``` mobrovac@xenon:~$ check-restbase Generic error: 'NoneType' object... [17:04:03] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2841986 (10mobrovac) [17:04:47] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#1526094 (10mobrovac) @Mvolz would you mind looking into making the Citoid extension using the RESTBase endpoint instead of placing a call to `citoid.wikimedia.org` ? [17:05:28] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [17:08:44] (03CR) 10Aaron Schulz: "The problem with uptime is that a server may have been up a long time before it was pooled. I guess it would still catch simple restarts t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 (owner: 10Aaron Schulz) [17:15:50] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2842035 (10Cmjohnson) 05Open>03Resolved Added lead to spares list (google tracking sheet) (@RobH ) [17:16:15] (03CR) 10Dzahn: "you have convinced me with your comments on the ticket (and real data! nice! thanks) that we should not do this one but keep downvoting li" [puppet] - 10https://gerrit.wikimedia.org/r/322907 (https://phabricator.wikimedia.org/T144667) (owner: 10Dzahn) [17:16:48] 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2842039 (10Cmjohnson) @joe Is it okay to resolve this task? [17:17:09] (03Abandoned) 10Dzahn: puppet-lint: ignore 'lines over 140 chars' warnings [puppet] - 10https://gerrit.wikimedia.org/r/322907 (https://phabricator.wikimedia.org/T144667) (owner: 10Dzahn) [17:18:36] !log restbase deploy end of 1651e35 [17:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:01] 06Operations, 10ops-eqiad: check stat1004 (or another identical R430) for PCIe expansion space - https://phabricator.wikimedia.org/T151080#2842044 (10Cmjohnson) a:05Cmjohnson>03RobH The PSU output is 550W. I think it's safe to say that modifying the current server is not a valid option. [17:31:33] 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): setup/install restbase-test100[123] - https://phabricator.wikimedia.org/T151075#2842112 (10Cmjohnson) [17:34:16] (03PS1) 10Cmjohnson: Removing dns entries for decommissioned servers mw1017 and mw1099 T151303 [dns] - 10https://gerrit.wikimedia.org/r/324940 [17:34:29] (03CR) 10jenkins-bot: [V: 04-1] Removing dns entries for decommissioned servers mw1017 and mw1099 T151303 [dns] - 10https://gerrit.wikimedia.org/r/324940 (owner: 10Cmjohnson) [17:40:37] (03PS3) 10BBlack: rcstream: single-backend with manual failover [puppet] - 10https://gerrit.wikimedia.org/r/317132 (https://phabricator.wikimedia.org/T147845) [17:40:39] (03PS1) 10BBlack: misc: get rid of hash support and maintenance [puppet] - 10https://gerrit.wikimedia.org/r/324941 [17:40:41] (03PS1) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 [17:42:22] (03CR) 10jenkins-bot: [V: 04-1] VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 (owner: 10BBlack) [17:43:39] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Hardware decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151303#2842122 (10Cmjohnson) Removing final dns remnants and came across this....I rather get @joe to make sure it's okay to remove before making any changes. Ple... [17:46:27] 06Operations, 06Analytics-Kanban: setup/install thorium/wmf4726 as stat1001 replacement - https://phabricator.wikimedia.org/T151816#2842125 (10Cmjohnson) [17:46:29] 06Operations, 10ops-eqiad: update label/racktables visible label for thorium/wmf4726 - https://phabricator.wikimedia.org/T151818#2842123 (10Cmjohnson) 05Open>03Resolved done [17:46:49] (03PS2) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 [17:46:58] 06Operations, 10hardware-requests: Decommission analytics1026 and analytics1015 - https://phabricator.wikimedia.org/T147313#2842126 (10Cmjohnson) p:05Normal>03Low [17:47:13] (03PS1) 10Mobrovac: Citoid: Add the wskey parameter [puppet] - 10https://gerrit.wikimedia.org/r/324943 (https://phabricator.wikimedia.org/T1084) [17:53:48] 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2842132 (10fgiunchedi) >>! In T150560#2840250, @bearND wrote: > The issue is not directly in Swagger. It's just that a S... [17:56:11] (03PS3) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 [18:03:46] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2842146 (10ellery) @RobH Thank you for the thorough investigation :). Now we know that the stat machines cannot accommodate a top-of-the-line GPU. Tha... [18:07:15] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: Remove labsdb1008- it was renamed to db1095 [puppet] - 10https://gerrit.wikimedia.org/r/324930 (owner: 10Jcrespo) [18:11:18] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:13:23] (03PS6) 10Filippo Giunchedi: Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [18:16:49] (03CR) 10Filippo Giunchedi: [C: 032] Add fontconfig file for the pdf render service [puppet] - 10https://gerrit.wikimedia.org/r/324747 (owner: 10GWicke) [18:18:54] !ping [18:19:03] am I here? [18:19:27] (03PS4) 10BBlack: rcstream: single-backend with manual failover [puppet] - 10https://gerrit.wikimedia.org/r/317132 (https://phabricator.wikimedia.org/T147845) [18:19:29] (03PS2) 10BBlack: misc: get rid of hash support and maintenance [puppet] - 10https://gerrit.wikimedia.org/r/324941 [18:19:32] (03PS4) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 [18:19:33] (03PS1) 10BBlack: simplify security_audit backend for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/324947 [18:19:47] I am I think [18:19:54] yes [18:20:10] jouncebot: greet yuvi [18:20:20] (03PS1) 10Filippo Giunchedi: pdfrender: more restrictive permissions on .config [puppet] - 10https://gerrit.wikimedia.org/r/324948 [18:20:23] :D [18:22:25] (03CR) 10Filippo Giunchedi: [C: 032] pdfrender: more restrictive permissions on .config [puppet] - 10https://gerrit.wikimedia.org/r/324948 (owner: 10Filippo Giunchedi) [18:23:43] (03CR) 10GWicke: pdfrender: more restrictive permissions on .config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324948 (owner: 10Filippo Giunchedi) [18:27:55] (03PS5) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 [18:31:17] (03CR) 10Filippo Giunchedi: pdfrender: more restrictive permissions on .config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324948 (owner: 10Filippo Giunchedi) [18:32:39] (03PS1) 10Filippo Giunchedi: pdfrender: 0500 for ~/.config [puppet] - 10https://gerrit.wikimedia.org/r/324953 [18:35:15] (03PS1) 10Chad: Gerrit: No need for gerrit's private key to be writable [puppet] - 10https://gerrit.wikimedia.org/r/324954 [18:35:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "CI is busy with mediawiki tests this time of the day" [puppet] - 10https://gerrit.wikimedia.org/r/324953 (owner: 10Filippo Giunchedi) [18:36:40] 06Operations, 10ops-eqiad: check stat1004 (or another identical R430) for PCIe expansion space - https://phabricator.wikimedia.org/T151080#2842259 (10RobH) a:05RobH>03Cmjohnson Can you advise how much space there is for a card in the system? We may end up adding in another card like the Gefore GT 730, its... [18:39:18] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [18:40:07] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.30 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/324919 (https://phabricator.wikimedia.org/T150758) (owner: 10Gilles) [18:41:33] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2842281 (10RobH) So the Titan line is right out at 10.5" long. There are others that are half that size in the next series down: http://www.geforce.... [18:42:52] (03CR) 10Alex Monk: "I reverted it because it turned out to have been merged without anyone prepared to actually restart things to put the change into effect. " [puppet] - 10https://gerrit.wikimedia.org/r/324766 (https://phabricator.wikimedia.org/T151570) (owner: 10Alex Monk) [18:46:10] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2842298 (10chasemp) [18:46:13] (03CR) 10Alex Monk: Redirect m.wikipedia.org to portal (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285936 (https://phabricator.wikimedia.org/T69015) (owner: 10Dereckson) [18:46:48] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:48:18] !log deploy thumbor 0.1.30 to thumbor100[12] [18:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:06] 06Operations, 06Security-Team: Allow the production cluster to access *.wmflabs.org IPs - https://phabricator.wikimedia.org/T95714#2842299 (10Dereckson) [18:50:10] 06Operations, 06Security-Team: Allow the production cluster to access *.wmflabs.org IPs - https://phabricator.wikimedia.org/T95714#1198268 (10Dereckson) I guess we could mark this resolved, as it describes an old situation, now working per T95714#1470497 test. [18:50:19] (03CR) 10Filippo Giunchedi: [C: 031] gerrit (2.13.3-wmf.1) jessie-wikimedia; urgency=low [debs/gerrit] - 10https://gerrit.wikimedia.org/r/323545 (https://phabricator.wikimedia.org/T146350) (owner: 10Chad) [18:53:38] PROBLEM - DPKG on thumbor1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:54:44] (03PS2) 10Dereckson: Redirect m.wikipedia.org to portal [puppet] - 10https://gerrit.wikimedia.org/r/285936 (https://phabricator.wikimedia.org/T69015) [18:55:05] (03CR) 10Alex Monk: "Weird. When I ran this, it produced the same file with a different timestamp." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324918 (https://phabricator.wikimedia.org/T152201) (owner: 10Aude) [18:55:16] (03CR) 10Dereckson: "Yes, it should." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285936 (https://phabricator.wikimedia.org/T69015) (owner: 10Dereckson) [18:55:38] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [18:57:50] (03PS1) 10Tim Landscheidt: Tools: Enable PHP module mcrypt on Trusty execution nodes [puppet] - 10https://gerrit.wikimedia.org/r/324957 (https://phabricator.wikimedia.org/T97857) [19:03:48] !log rollback python-thumbor-wikimedia to 0.1.29 [19:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:18] !log depooling all services on scb1001 for service restart [19:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:32] !log scb1001 - restarting all (-oid) services [19:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:52] (03PS1) 10Rush: labstore: centralize snapshot-manager into bsync module [puppet] - 10https://gerrit.wikimedia.org/r/324958 [19:09:15] (03PS2) 10Rush: labstore: centralize snapshot-manager into bsync module [puppet] - 10https://gerrit.wikimedia.org/r/324958 [19:09:16] !log scb1001 - re-pooling all services [19:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:11] (03CR) 10Alex Monk: Redirect m.wikipedia.org to portal (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/285936 (https://phabricator.wikimedia.org/T69015) (owner: 10Dereckson) [19:13:54] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/324957 (https://phabricator.wikimedia.org/T97857) (owner: 10Tim Landscheidt) [19:14:20] (03CR) 10Rush: [C: 032 V: 032] labstore: centralize snapshot-manager into bsync module [puppet] - 10https://gerrit.wikimedia.org/r/324958 (owner: 10Rush) [19:14:22] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842411 (10fgiunchedi) While deploying 0.1.30 I found out that the `/healthcheck` endpoint returns 404 ``` filippo@deployment-imagescaler01:~$ cu... [19:14:48] gilles: ^ re T150749 looks like it might make some endpoints fail, e.g. healthcheck [19:14:48] T150749: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749 [19:15:30] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Upgrade gifsicle to 1.88-2~bpo8+1 on Thumbor boxes - https://phabricator.wikimedia.org/T151565#2842414 (10fgiunchedi) 05Open>03Resolved This is complete [19:15:31] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2842416 (10fgiunchedi) [19:15:34] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2842413 (10Legoktm) I am totally fine with updating the footer message to make it clearer th... [19:15:48] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:22:30] !log scb1002 - depooling, restarting services, repooling [19:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:12] !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=scb1002.eqiad.wmnet,service=apertium [19:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:19] !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: name=scb1002.eqiad.wmnet,service=apertium [19:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:51] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2842437 (10fgiunchedi) >>! In T149451#2753358, @bd808 wrote: > What kind of log event volume would we be adding here? Something like doubling the current hhvm.log volume? I took another look at th... [19:35:16] !log scb1003 - depooling, restarting services, repooling [19:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:25] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842450 (10Gilles) try localhost:8802/thumbor/healthcheck [19:36:48] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[bdsync] [19:37:05] ^looking [19:38:31] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842451 (10Gilles) Hmm yeah, doesn't seem to work either on deployment-imagescaler01 [19:39:54] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842454 (10Gilles) Oh, but the new core handler has to be added to the configuration before the prefix can possibly work. I'll prepare a patch for... [19:45:26] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842462 (10Gilles) Not enough for healthcheck, just like core it will need to get a custom wrapper to define the regexp that was hardcoded in the... [19:45:59] (03PS2) 10Andrew Bogott: Keystone hook: Change project id to == project name [puppet] - 10https://gerrit.wikimedia.org/r/324928 (https://phabricator.wikimedia.org/T150091) [19:46:01] (03PS1) 10Andrew Bogott: Keystone: set default project membership role to 'user' [puppet] - 10https://gerrit.wikimedia.org/r/324962 [19:46:03] (03PS1) 10Andrew Bogott: Keystone: add 'observer' domain [puppet] - 10https://gerrit.wikimedia.org/r/324963 (https://phabricator.wikimedia.org/T150092) [19:47:43] (03CR) 10jenkins-bot: [V: 04-1] Keystone: add 'observer' domain [puppet] - 10https://gerrit.wikimedia.org/r/324963 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [19:47:48] (03CR) 10Andrew Bogott: [C: 032] Keystone: set default project membership role to 'user' [puppet] - 10https://gerrit.wikimedia.org/r/324962 (owner: 10Andrew Bogott) [19:51:56] (03PS6) 10BBlack: VCL refactor: split cache/app backend support [puppet] - 10https://gerrit.wikimedia.org/r/324942 [19:54:16] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842471 (10Gilles) [19:55:00] (03PS1) 10Gilles: Upgrade to 0.1.31 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/324965 (https://phabricator.wikimedia.org/T150749) [19:59:16] (03PS2) 10Andrew Bogott: Keystone: add 'observer' domain [puppet] - 10https://gerrit.wikimedia.org/r/324963 (https://phabricator.wikimedia.org/T150092) [19:59:18] (03PS3) 10Andrew Bogott: Keystone hook: Change project id to == project name [puppet] - 10https://gerrit.wikimedia.org/r/324928 (https://phabricator.wikimedia.org/T150091) [20:00:22] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.31 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/324965 (https://phabricator.wikimedia.org/T150749) (owner: 10Gilles) [20:00:24] (03PS1) 10Gilles: Reintroduce Thumbor core handlers through wrappers [puppet] - 10https://gerrit.wikimedia.org/r/324967 (https://phabricator.wikimedia.org/T150749) [20:03:06] (03PS1) 10Rush: Revert "labstore: centralize snapshot-manager into bsync module" [puppet] - 10https://gerrit.wikimedia.org/r/324971 [20:03:11] (03PS2) 10Rush: Revert "labstore: centralize snapshot-manager into bsync module" [puppet] - 10https://gerrit.wikimedia.org/r/324971 [20:03:27] !log scb1004 - depooling, restarting services, repooling [20:03:35] (03CR) 10Rush: [C: 032 V: 032] Revert "labstore: centralize snapshot-manager into bsync module" [puppet] - 10https://gerrit.wikimedia.org/r/324971 (owner: 10Rush) [20:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:48] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:11:56] (03PS1) 10Chad: Gerrit: Swap to using openjdk8 [puppet] - 10https://gerrit.wikimedia.org/r/324972 [20:12:29] (03PS2) 10Filippo Giunchedi: Reintroduce Thumbor core handlers through wrappers [puppet] - 10https://gerrit.wikimedia.org/r/324967 (https://phabricator.wikimedia.org/T150749) (owner: 10Gilles) [20:12:55] (03CR) 10Paladox: [C: 031] Gerrit: Swap to using openjdk8 [puppet] - 10https://gerrit.wikimedia.org/r/324972 (owner: 10Chad) [20:15:38] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:15:42] (03CR) 10Filippo Giunchedi: [C: 032] Reintroduce Thumbor core handlers through wrappers [puppet] - 10https://gerrit.wikimedia.org/r/324967 (https://phabricator.wikimedia.org/T150749) (owner: 10Gilles) [20:17:01] (03CR) 10Thcipriani: [C: 031] Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [20:17:11] (03CR) 10Chad: "There's no useful content in /home/mwdeploy, and I can find no references to it outside of this file, should be safe." [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) (owner: 10Chad) [20:18:02] !log scb2001 - depooling, restarting services, repooling [20:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:38] 06Operations, 10Analytics, 10Analytics-Cluster, 06Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2842537 (10RobH) The Dell PowerEdge R730 can add up to two add on GPUs, via their own ordering during the time of system build. We don't have any exp... [20:21:18] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 798253 msg: ocg_render_job_queue 0 msg [20:21:18] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 798258 msg: ocg_render_job_queue 0 msg [20:21:28] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 798264 msg: ocg_render_job_queue 0 msg [20:24:51] (03Restored) 10Dzahn: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:28:19] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842566 (10fgiunchedi) @Gilles thanks! the healthcheck now works, I'm seeing some 500s in the error log perhaps still related to changes in handle... [20:30:14] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842569 (10Gilles) That looks like the codepath that happens when the request storage is still turned on. Should have been turned off by: https://... [20:31:29] !log scb2002 - depooling, restarting services, repooling [20:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:31] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842573 (10Gilles) Yeah I see that it's still defined somewhere in the config files found on thumbor1001. Maybe it's coming from the Debian packag... [20:33:09] 06Operations, 10Ops-Access-Requests: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2842574 (10demon) [20:33:16] gilles: I think that's because the linked review is for vagrant [20:33:27] hah, yeah [20:33:29] (03CR) 10Dzahn: Phabricator: rsync /srv/repos from iridium to phab2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:33:34] I'll make the same change for puppet [20:34:59] (03CR) 1020after4: "why use rsync daemon? couldn't we just rsync over ssh?" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:35:14] (03PS1) 10Gilles: Stop using request storage in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/324974 (https://phabricator.wikimedia.org/T150757) [20:35:21] gilles: ok thanks! [20:35:24] (03PS2) 10Gilles: Stop using request storage in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/324974 (https://phabricator.wikimedia.org/T150757) [20:35:26] (03CR) 10Dzahn: "either we'd have to break up the regex in site.pp and just include it on 2001.. or it needs code in the migration.pp to make sure it only " [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:36:10] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842592 (10Gilles) https://gerrit.wikimedia.org/r/#/c/324974/ [20:36:54] (03CR) 10Filippo Giunchedi: [C: 032] Stop using request storage in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/324974 (https://phabricator.wikimedia.org/T150757) (owner: 10Gilles) [20:36:56] (03CR) 10Dzahn: "no we cant rsync over ssh between them unless we manually mess with ferm and key forwarding and stuff, this is the cleaner puppetized vers" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:37:54] (03CR) 10Dzahn: "either we can just remove it again after we are done,, or we can consider it a feature that it just always running on the non-active serve" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:39:37] !log upgrade thumbor to 0.1.31 on thumbor100[12] [20:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:02] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2842601 (10Gilles) [20:40:04] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Match cache headers between thumbor and mediawiki - https://phabricator.wikimedia.org/T150642#2842600 (10Gilles) 05Open>03Resolved [20:40:16] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Improve Content-Disposition support in Thumbor - https://phabricator.wikimedia.org/T151072#2842616 (10Gilles) 05Open>03Resolved [20:40:18] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2636499 (10Gilles) [20:41:14] !log scb2003 - depooling, restarting services, repooling [20:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:02] !log roll-restart pdfrender on scb1* [20:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:18] (03CR) 1020after4: "I'd rather this be a one-off than an ongoing rsync job. Once phabricator is configured for clustering it'll get completely confused if som" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:43:25] (03CR) 10Chad: [C: 04-1] "Just so I don't accidentally it." [puppet] - 10https://gerrit.wikimedia.org/r/324972 (owner: 10Chad) [20:43:38] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:43:58] https://gerrit.wikimedia.org/r/#/c/324954/ - super trivial permission fix for gerrit if someone's got a sec. [20:44:09] No-op as far as gerrit's concerned, won't need service restart or anything [20:44:19] (03CR) 1020after4: "and we already have firm rules for ssh between the servers. ssh keys are also set up between the hosts." [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:44:49] (03CR) 10Dzahn: "i did not mean anything automatic, no cronjob or anything. i just meant the fact that rsyncd is running and accepted connections from the " [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:46:49] 06Operations, 10Ops-Access-Requests: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2842638 (10greg) Yes please. [20:47:09] (03CR) 10Dzahn: "this just gives the ability to rsync the repos over with a simple command, but running that command is totally still a manual thing on pur" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:47:54] 06Operations, 10Ops-Access-Requests: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2842574 (10Paladox) +1 [20:49:31] (03CR) 10Dzahn: "like i said we can also just remove the include again after we rsynced it a single time. but if it's for disaster recovery we want to repe" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:50:03] 06Operations: should we make privatewiki list available to puppet without maintaining two lists? - https://phabricator.wikimedia.org/T152100#2842656 (10demon) Another option: move the dblists to a separate git repo (submodule of mw-config), then this repo would be available to anything needing them. [20:53:42] (03CR) 10Dzahn: "it can be considered a one-off and simply be reverted, but i'd still prefer doing it the clean puppet way instead of "disable puppet, disa" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [20:54:30] (03PS1) 10GWicke: Whitelist /home/pdfrender/.config in firejail profile [puppet] - 10https://gerrit.wikimedia.org/r/324976 [20:55:57] twentyafterfour ^^ [20:57:33] ACKNOWLEDGEMENT - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn its normal that pdh is stopped on non-active server [20:58:16] (03CR) 1020after4: [C: 031] "dzahn: I definitely didn't mean to suggest " "disable puppet, disable firewalling, manually run rsync command, check permissions, cleanup"" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:01:24] marostegui: hi! yt? [21:02:40] I'm trying to figure out an issue with CentralNotice (which displays Fundraising banners), trying to see if maybe it's database-related.... [21:02:43] whatever Exec[/usr/local/bin/labs-ip-alias-dump.py] [21:02:57] https://phabricator.wikimedia.org/T152122 [21:03:01] does, but it has some kinf of issue it seems on labtestservices [21:03:25] well, "test" so just saying [21:05:38] !log scb2004 - depooling, restarting services, repooling [21:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:42] (03CR) 10Yuvipanda: Tools: Enable PHP module mcrypt on Trusty execution nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324957 (https://phabricator.wikimedia.org/T97857) (owner: 10Tim Landscheidt) [21:20:09] (03CR) 10Andrew Bogott: "This turns out not to work because keystone is (unsurprisingly) broken." [puppet] - 10https://gerrit.wikimedia.org/r/324963 (https://phabricator.wikimedia.org/T150092) (owner: 10Andrew Bogott) [21:22:24] (03CR) 10Paladox: [C: 031] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:25:01] (03PS1) 10Dzahn: (WIP) services: create global service restart script [puppet] - 10https://gerrit.wikimedia.org/r/325039 [21:25:03] (03PS7) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [21:25:09] (03PS8) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [21:25:13] mutante ^^ something like that [21:25:18] so we doint need to change the node [21:25:24] or split the regex [21:25:34] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2842762 (10Gilles) [21:25:36] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Fix memory leaks in Thumbor plugins - https://phabricator.wikimedia.org/T150757#2842760 (10Gilles) 05Open>03Resolved Time will tell if all the leaks are fixed, but the bulk of the problem seems to have been solved, since I see that all th... [21:25:49] (03PS9) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [21:25:55] (03CR) 10Gergő Tisza: Set $wgSoftBlockRanges (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 (owner: 10Anomie) [21:27:00] (03CR) 10jenkins-bot: [V: 04-1] (WIP) services: create global service restart script [puppet] - 10https://gerrit.wikimedia.org/r/325039 (owner: 10Dzahn) [21:28:44] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:30:08] (03PS10) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [21:30:44] mutante could you run puppet complier on ^^ please? [21:31:14] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2842784 (10Gilles) [21:31:17] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should know about the lossless thumbnail parameter - https://phabricator.wikimedia.org/T150758#2842783 (10Gilles) 05Open>03Resolved [21:32:25] (03PS2) 10Dzahn: (WIP) services: create global service restart script [puppet] - 10https://gerrit.wikimedia.org/r/325039 [21:32:31] paladox: in a minute [21:32:39] Ok thanks [21:33:19] (03CR) 10Paladox: [C: 031] "@Dzahn would we be able to merge this please?" [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) (owner: 10Paladox) [21:33:24] (03PS4) 10Paladox: Phabricator: Set domain for phab2001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) [21:34:51] (03CR) 10jenkins-bot: [V: 04-1] (WIP) services: create global service restart script [puppet] - 10https://gerrit.wikimedia.org/r/325039 (owner: 10Dzahn) [21:34:58] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [21:36:03] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2842793 (10Gilles) [21:36:05] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor hangs on some TIFF files - https://phabricator.wikimedia.org/T151454#2842792 (10Gilles) 05Open>03Resolved [21:38:11] 06Operations, 10Mail, 10MediaWiki-Email, 10Wikimedia-General-or-Unknown, and 3 others: Email server's DMARC config prevents users from sending emails via Special:EmailUser - https://phabricator.wikimedia.org/T66795#2842797 (10Legoktm) [21:39:12] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2842801 (10Gilles) [21:39:14] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor errors on some GIF files - https://phabricator.wikimedia.org/T151455#2842799 (10Gilles) 05Open>03Resolved Thumbor now doing what Mediawiki can't on those files: ``` Dec 2 21:38:24 ms-fe1001 proxy-server: HTTP status code mismatc... [21:39:18] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:40:05] (03CR) 10Krinkle: "Why is that weird?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324918 (https://phabricator.wikimedia.org/T152201) (owner: 10Aude) [21:41:48] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor can't render a few SVGs that Mediawiki can - https://phabricator.wikimedia.org/T150754#2842809 (10Gilles) I can still trigger a 500 for http://upload.wikimedia.org/wikipedia/commons/thumb/e/e9/Northumberland_in_England.svg/62px-Northu... [21:42:54] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2842812 (10Gilles) [21:42:58] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor errors when %0A is in the filename part of the request - https://phabricator.wikimedia.org/T151453#2842811 (10Gilles) 05Open>03Resolved [21:44:04] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2640758 (10Gilles) [21:44:06] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor 404s when the original has a ? in its filename - https://phabricator.wikimedia.org/T150760#2842813 (10Gilles) 05Open>03Resolved [21:45:13] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2842819 (10Gilles) [21:45:15] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Thumbor should reject some malformed thumbnail URLs - https://phabricator.wikimedia.org/T150749#2842818 (10Gilles) 05Open>03Resolved [21:45:19] 06Operations, 10Citoid, 10RESTBase, 10RESTBase-API, and 5 others: Set-up Citoid behind RESTBase - https://phabricator.wikimedia.org/T108646#2842822 (10kaldari) [21:47:03] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/4777/" [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) (owner: 10Paladox) [21:47:38] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) (owner: 10Paladox) [21:47:58] (03CR) 10Paladox: "Should we also merge this too please?" [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [21:48:04] (03CR) 10Dzahn: "well, right after merge i noticed it says "wmflabs.org' but that's not what we are doing, but anyways, it does what was intended" [puppet] - 10https://gerrit.wikimedia.org/r/324832 (https://phabricator.wikimedia.org/T152132) (owner: 10Paladox) [21:49:37] paladox: works, i see on phab2001 how several places it gets fixed [21:49:44] Oh :) :) [21:49:57] apache rewrite rules, virtual host. taskcreation URL [21:50:06] well that last one, is not gonna be right [21:50:07] mutante https://gerrit.wikimedia.org/r/324797 that should make the domain working properly :) [21:50:21] lol [21:50:37] -taskcreation = task@phabricator.wikimedia.org [21:50:37] +taskcreation = task@phabricator-new.wikimedia.org [21:50:51] that may work [21:51:05] this one looks good: [21:51:06] -host = https://phabricator.wikimedia.org/api/ [21:51:06] +host = https://phabricator-new.wikimedia.org/api/ [21:51:07] as long as codfw can connect to the wikimedia email system [21:51:09] for the bot [21:51:09] :) [21:51:32] i really dont know about the taskbymail stuff [21:51:44] oh [21:51:45] mutante https://gerrit.wikimedia.org/r/324797 should allow us to access it :). [21:56:39] (03PS11) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [21:56:54] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [21:57:12] (03PS1) 10Merlijn van Deen: Add tools hiera common.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/325041 [21:57:50] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#2842864 (10MarkTraceur) p:05High>03Normal I don't see this as "high" priority, but I'm willing to be co... [22:01:54] (03CR) 10Dzahn: Phabricator: rsync /srv/repos from iridium to phab2001 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:02:46] (03CR) 10Dzahn: [C: 04-1] "please see inline comments on PS10. i want to use "role/common/phabricator/main.yaml:phabricator_active_server" setting to decide if rsync" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:02:56] bblack: ema: who owns GeoIP lookup infrastructure? [22:02:58] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:05:41] (03PS1) 10Gerrit Patch Uploader: Fix up puppet-compiler for labs usage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/325042 [22:05:43] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/325042 (owner: 10Gerrit Patch Uploader) [22:06:16] mutante: Got a sec? https://gerrit.wikimedia.org/r/#/c/324954/ is super trivial :) [22:07:18] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:08:02] (03PS2) 10Dzahn: Gerrit: No need for gerrit's private key to be writable [puppet] - 10https://gerrit.wikimedia.org/r/324954 (owner: 10Chad) [22:08:04] yep, ok [22:10:50] mutante: ty! [22:11:22] (03CR) 10Dzahn: [C: 032] Gerrit: No need for gerrit's private key to be writable [puppet] - 10https://gerrit.wikimedia.org/r/324954 (owner: 10Chad) [22:11:44] merged on master [22:15:29] It doesn't need a force run [22:15:32] It won't do anything really [22:16:51] (03PS12) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [22:17:10] (03PS13) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [22:17:14] (03PS14) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [22:18:18] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:18:55] (03PS15) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [22:19:21] !log restarting salt-minion on mw-canary [22:19:30] (03PS16) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [22:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:54] (03CR) 10Tim Landscheidt: Tools: Enable PHP module mcrypt on Trusty execution nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/324957 (https://phabricator.wikimedia.org/T97857) (owner: 10Tim Landscheidt) [22:23:42] jouncebot: next [22:23:42] In 61 hour(s) and 36 minute(s): ElectronPdfService extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161205T1200) [22:24:22] !log restarting salt-minion on all appservers (via debdeploy -s all-mw) [22:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:15] AndyRussG: it really depends what you mean by that [22:26:42] bblack: I'm thinking mainly of the db backend (whatever that may be, I don't know) [22:27:01] AndyRussG: I assume you mean from the perspective of CN and the GeoIP cookie right? [22:27:05] We had another sever dip in banners this morning [22:27:07] Yep [22:27:34] (03CR) 10Paladox: "@BBlack would you be able to review this please? It needs a +1 from taffic." [puppet] - 10https://gerrit.wikimedia.org/r/324797 (https://phabricator.wikimedia.org/T137928) (owner: 10Dzahn) [22:27:35] that "database" is just a file. We pay a commercial vendor for routine updates to the db file, a library loads it up in memory, etc [22:28:07] (03CR) 10Dzahn: "this looks good now, that's what i meant, yea. this makes sure we don't have to edit anything here when the active server changes. just on" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:28:10] the updates are infrequent, perhaps once a week at best? they wouldn't be a cause of a daily pattern I don't think, but I can look into details [22:28:25] but the whole file just gets replaced and reloaded at runtime [22:28:35] there's no "database" in the traditional sense like a DB server or mysql or anything [22:28:37] (03CR) 10Paladox: "Ok" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:28:53] bblack: K... hmmm... Yeah I'm just grasping at straws still :( [22:28:54] (03PS1) 10Merlijn van Deen: Puppet: refactor puppet-enc include [puppet] - 10https://gerrit.wikimedia.org/r/325046 [22:28:56] (03CR) 10Dzahn: "P.S. it's not really "rsync repos" it's just "allow rsyncing", it sounds like something automatic but it's not, it just makes it possible" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:29:24] AndyRussG: looking at just one server, it looks like the last DB update was synced out at Nov 27 03:47 [22:29:48] Is it possible that something might happen at some DC but not others that could make that file not accessible, so that Varnish hosts can't read it properly for a time? [22:30:15] The file is copied onto hosts directly, I guess, and loaded into memory? [22:30:15] (03PS17) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [22:31:06] AndyRussG: no, it's a very simplistic process that syncs that file, and all of the servers last updated around that same time on Nov 27, +/- about 30 mins [22:31:37] bblack: so all the servers have a local copy of it? [22:31:50] AndyRussG: be careful of inferring DC patterns. I saw some of that traffic on the ticket, but usually that's because different DCs have a different blend of regional clients, rather than a per-DC fault on our end [22:32:07] AndyRussG: yes, it's just a file sent out to all servers once in a while. they all have it as a local file on disk [22:33:01] bblack: Hmmm K... So basically there is nothing that could happen that would make lots of servers not be able to read it properly all at once, right? [22:33:25] Also, not sure what u mean by, "different DCs have a different blend of regional clients, rather than a per-DC fault on our end"... Could you elaborate pls? [22:34:31] AndyRussG: what I mean is when you see different statistical results for, say, esams vs eqiad, it's most-often because different fractions of the globe commonly talk to esams vs eqiad. Not because of an actual technical glitch inside our servers that affects esams but not eqiad. [22:37:58] bblack: ah yes, indeed. But it may be a factor we could take into account to isolate the problem. That is, we haven't found any criteria (project, language, country) that has a complete blackout during this time. But if there were an issue at only one DC, I think what you just mentioned ^ could explain why some users did still get banners ('cause they're the minority visiting the CN-targeted [22:38:00] lang, project from a different DC than most). So maybe if we combine DC and some other criteria, we'll find where things went completely off, and that could be key in finding the issue.... [22:38:14] Tho again, I'm really grasping at pixelated straws [22:38:34] if we don't have a solid lead, it could be anything [22:40:23] bblack: yep! [22:40:46] but if there's any kind of subtle pattern anywhere (and there probably is, we just haven't found it yet), whatever that pattern is (region of the globe, types/versions of browser, client network speed, who knows) - any kind of pattern elsewhere in any other aspect of this, will *also* tend to show as a pattern of differing effects at different DCs of ours [22:41:06] because they all have distinct mixes of those things [22:41:26] so almost everything subtle we look at can seem like "a problem with one datacenter" when it usually isn't [22:41:26] Hmmm interesting yeah [22:41:35] K great point [22:41:53] !log restarting salt-minion on all analytics servers [22:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:02] AndyRussG: if we limit ourselves to thinking about the GeoIP part of puzzle here, if anything the most likely problem would be something related to the cookies themselves, rather than the database [22:43:15] e.g. when/how they expire or are re-set, etc [22:43:39] bblack: I think they just last a year, no? [22:44:03] (03PS2) 10Filippo Giunchedi: Whitelist /home/pdfrender/.config in firejail profile [puppet] - 10https://gerrit.wikimedia.org/r/324976 (owner: 10GWicke) [22:44:26] AndyRussG: no, I don't think that's correct. They're session cookies. [22:44:38] (no actual fixed lifetime, but go away when browser closes) [22:46:09] again grasping at thins straws, but: we could have the kind of issue where on the very first request from a fresh browser, when we initially set the cookie, it's not in time for CN to see (under some conditions with some UAs?), and that tends to happen mostly at certain times of day when people are firing up new browser sessions for the day (your times are EU morning -ish) [22:46:17] bblack: ah hmm K... Yeah I thought I'd seen a different behaviour, but I must have not checked carefully. Now I see they're session [22:46:31] (03PS1) 10Merlijn van Deen: Add fake clushuser keypair [labs/private] - 10https://gerrit.wikimedia.org/r/325050 [22:46:52] i.e. a race condition between initially setting the cookie in the browser from that first request to us, and it being availble for JS to read very shortly after [22:46:59] bblack: I dunno, it's a pretty sever dropoff... [22:47:08] (03PS18) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [22:47:13] interesting theory... [22:47:29] Also, it doesn't come back on gradually, but jumps up at the end of the hour [22:47:48] yeah [22:48:36] the rest of the machinery of this (RL and the CN JS, etc) is pretty opaque and complex to me though, I don't have much insight once we start looking at that level [22:49:08] nothing in the code that sets the cookies is sensitive to specific times of day, though [22:49:19] (03CR) 10Merlijn van Deen: "FWIW, puppet-compiler for the tools-puppetmaster only reports the file source change:" [puppet] - 10https://gerrit.wikimedia.org/r/325046 (owner: 10Merlijn van Deen) [22:49:35] (there's no standard automated thing that goes off at that time, like say an update of some kind to geoip or to varnish) [22:49:36] (03CR) 10Filippo Giunchedi: [C: 032] Whitelist /home/pdfrender/.config in firejail profile [puppet] - 10https://gerrit.wikimedia.org/r/324976 (owner: 10GWicke) [22:50:54] (03CR) 10Dzahn: "getting there, much better. to be perfect now we just need to get rid of the hardcoded IP address, it needs to change depending on which i" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [22:51:19] bblack: ...K hmm... Yeah I'm leaning more towards some DB issue (i.e., the DB query that gets data about which campaigns and banners may be available to a user) [22:52:45] can you point me at wherever that code is? I'm curious (the stuff that fetches campaigns) [22:53:13] yea one sec :) [22:53:32] also related: maybe the URL that provides that list to the UA? (or is that deeply rolled up in some RL output)? [22:54:45] (03PS1) 10Gerrit Patch Uploader: puppet_compiler: include puppet-enc [puppet] - 10https://gerrit.wikimedia.org/r/325053 [22:54:47] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [puppet] - 10https://gerrit.wikimedia.org/r/325053 (owner: 10Gerrit Patch Uploader) [22:55:44] https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/65aee592c102cac6cd8c06b5368fb50c003670cd/includes/ChoiceDataProvider.php#L78 [22:56:01] bblack: ^ method that does the actual db query. In that class is the objectcache stuff too [22:56:21] (03PS2) 10Merlijn van Deen: puppet_compiler: include puppet-enc [puppet] - 10https://gerrit.wikimedia.org/r/325053 (owner: 10Gerrit Patch Uploader) [22:56:23] * AndyRussG stares at unused local variable [22:59:41] (03PS19) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [22:59:57] AndyRussG: [22:59:58] $start = $dbr->timestamp( time() + self::CACHE_TTL ); [22:59:58] $end = $dbr->timestamp(); [22:59:58] $conds = array( [22:59:58] 'notices.not_start <= ' . $dbr->addQuotes( $start ), [23:00:00] 'notices.not_end >= ' . $dbr->addQuotes( $end ), [23:00:30] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:00:35] unwrapping the above a bit mentally: CACHE_TTL is 1hour, and the conditions are on campaigns active to fetch [23:01:09] (03PS20) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:01:17] and it's basically saying "fetch campaigns whose end timestamp is >= now, and whose start timestamp is <= 1 hour from now" [23:01:37] (03PS21) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:01:39] bblack: yeah... [23:02:08] RECOVERY - Check systemd state on phab2001 is OK: OK - running: The system is fully operational [23:02:41] if you had a serial set of daily campaigns which were configured in the database separately in 1-day intervals, like: campaign1_start=20161201T0800,campaign1_end=20161202T0800,campsign2_start=20161202T0800,campaign2_end=20161203T0800, ... [23:02:46] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:03:02] that query would end up giving you a one hour gap every day, where none of the campaigns in that series are active [23:04:20] I think? [23:04:31] do we do daily serial campaigns for this stuff with aligned stop/start like that? [23:05:08] PROBLEM - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:05:20] hmmm maybe I'm reading the logic wrong, now I'm questioning myself again [23:05:48] (03PS22) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:05:58] bblack: interesting thought...!! I'm afraid not, tho. None on on FR is a serial campaigner ;p [23:06:28] still it's an interesting correlation that we're doing timestamp math there with a 1-hour constant, which is close to the window size of the daily problem... [23:06:31] twentyafterfour mutante PROBLEM - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:06:42] Yeah I also have trouble sorting out this caching logic [23:06:47] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:06:49] Yeah totally a possible lead! [23:07:11] maybe if not an outright problem with the query's timestamp logic, maybe something with the object cache after fetch and the expiry there? [23:07:24] (03PS23) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:08:24] (03PS1) 10Krinkle: tests: Clean up PHPUnit tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325054 [23:08:34] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:09:00] (03CR) 10jenkins-bot: [V: 04-1] tests: Clean up PHPUnit tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325054 (owner: 10Krinkle) [23:09:08] ok yeah the logic is right. right in the sense that it would give a 1 hour overlap to serial campaigns (on the boundary time, it would be fetching both campaigns) [23:09:16] (03PS24) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:09:21] yeah [23:09:33] hmmmm [23:10:04] But what you're saying about the object cache does sound right, i.e, that's a place to look for downages that might affect this [23:10:08] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:10:10] (03PS2) 10Krinkle: tests: Clean up PHPUnit tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325054 [23:10:29] I think I've asked before, but who is really on top of our prod objectcache setup? [23:10:41] I have absolutely no idea on that [23:11:09] maybe we're setting a TTL on those cached objects, and they expire out an hour before we're fetching replacements? [23:11:12] or something like that? [23:11:29] (03PS25) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:12:14] AndyRussG: in any case, what I can look at: what is the URL the browser hits to get a list of campaigns (which would then drive BannerLoader calls). I could look at the size of outputs on that. [23:12:22] (03PS1) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 [23:12:26] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:13:13] but if it's just bundled up with other things inside a giant RL blob, maybe not [23:13:14] (03CR) 10jenkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle) [23:14:21] (03PS26) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:14:24] bblack: yeah exactly. If you have an easy way of checking that, that'd be fantastic!! But yes, it's all RL-blobbed up [23:14:50] Though maybe there's 1 or 2 URLs with the most typical blob configuration, that we could sniff for changes in... [23:15:15] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:15:31] (03PS27) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:16:23] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:19:16] AndyRussG: just anecdotal, but if it hints somewhere... when I look at BannerLoader fetches in our 1/1000 data and compare 08:xx to 09:xx (bad vs normal)... [23:19:16] (03PS28) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:19:30] (03CR) 10Paladox: "@Dzahn it keeps failing the tests, not sure why though." [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:20:20] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:20:42] AndyRussG: it seems like the rates for banners CzechWikiCon and WMES_Wiki_Loves_Folk_2016 are relatively-unaffected. Whereas the FR ones like C1617_en6C_dsk_FR (etc) are pretty heavily affected, and also another affected is WLAfrica+2016 [23:21:08] bblack: yeah.... all mystery [23:21:40] is there something in common to the FR stuff and WLAfrica that they don't share with CzechWikiCon and the Folk one? [23:23:02] (03PS29) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:23:49] (03CR) 10jenkins-bot: [V: 04-1] Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:25:26] (03PS30) 10Paladox: Phabricator: rsync /srv/repos from iridium to phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:26:05] (03CR) 10Hashar: "We have a composer.lock and a bunch of dependencies due to multiversion :/ See also Kunal attempt on https://gerrit.wikimedia.org/r/#/c/1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle) [23:26:05] bblack: hmmm good point, lemme check. It's not geotaregeting (already checked that) but it could be some other aspect of campaign config..... [23:26:24] hashar: I'm working on making a new jenins job that uses my composer-dev-fetch script. [23:26:41] bblack: also BTW if you're interested in the RL-side of this discussion, we're talking now in #wikimedia-fundraising... :) [23:26:48] Krinkle: nice!!! [23:26:57] Krinkle: pointed you to Kunal changes in case you hadn't noticed it :d [23:27:23] (03CR) 10Krinkle: "No problem. composer-update works fine, but would overwrite vendor in Jenkins which means we're not testing it. This is the same as for wm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle) [23:28:06] Krinkle: composer-dev-fetch sounds good yes [23:28:34] feel free to amend the current job or create a new one in parallel. Or in short be bold :] [23:29:20] off to sleep for real ! *wave* [23:30:18] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:33:37] !log roll-restart pdfrender on sbc after applying fonts.conf firejail whitelist [23:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:01] gwicke: ^ [23:35:49] PROBLEM - pdfrender on scb1003 is CRITICAL: connect to address 10.64.32.153 and port 5252: Connection refused [23:36:48] PROBLEM - pdfrender on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 5252: Connection refused [23:40:21] (03CR) 10Alex Monk: "When Aude ran it, it produced a different result. This one includes the new wiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324918 (https://phabricator.wikimedia.org/T152201) (owner: 10Aude) [23:45:15] bblack: so in answer to your question about "something in common to the FR stuff and WLAfrica that they don't share with CzechWikiCon and the Folk one?", it's not targeting anons vs. logged-ins. [23:45:36] (which is a hard-to-spot param, but impacts on user selection for campaigns) [23:51:22] (03PS31) 10Dzahn: Phabricator: allow rsyncing /srv/repos from active to passive server [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:51:48] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.003 second response time [23:53:04] ACKNOWLEDGEMENT - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn we need to skip this check on the non-active server [23:54:40] (03CR) 10Dzahn: [C: 04-1] "compiler says Error: Could not find class phabricator::rsync" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:54:48] RECOVERY - pdfrender on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 264 bytes in 0.075 second response time [23:55:58] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:55:59] (03PS32) 10Paladox: Phabricator: allow rsyncing /srv/repos from active to passive server [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) [23:58:30] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/4779/" [puppet] - 10https://gerrit.wikimedia.org/r/324796 (https://phabricator.wikimedia.org/T137928) (owner: 10Paladox) [23:59:18] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures