[00:02:49] (03PS1) 10Andrew Bogott: labtest-recursor0 and labtest-ns0 [dns] - 10https://gerrit.wikimedia.org/r/264419 [00:03:53] (03PS2) 10Andrew Bogott: labtest-recursor0 and labtest-ns0 [dns] - 10https://gerrit.wikimedia.org/r/264419 [00:06:08] (03PS1) 10Andrew Bogott: Add dns to labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/264420 [00:06:20] (03CR) 10Yuvipanda: [C: 032] wikimetrics: Re-enable SSL redirection [puppet] - 10https://gerrit.wikimedia.org/r/264417 (owner: 10Madhuvishy) [00:06:58] (03CR) 10Andrew Bogott: [C: 032] labtest-recursor0 and labtest-ns0 [dns] - 10https://gerrit.wikimedia.org/r/264419 (owner: 10Andrew Bogott) [00:09:03] (03PS2) 10Andrew Bogott: Add dns to labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/264420 [00:13:52] (03CR) 10Andrew Bogott: [C: 032] Add dns to labtestservices2001 [puppet] - 10https://gerrit.wikimedia.org/r/264420 (owner: 10Andrew Bogott) [00:15:12] PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: puppet fail [00:17:47] (03PS1) 10Andrew Bogott: It probably doesn't work to include ldap client and server on the same host. [puppet] - 10https://gerrit.wikimedia.org/r/264423 [00:19:05] (03CR) 10Andrew Bogott: [C: 032] It probably doesn't work to include ldap client and server on the same host. [puppet] - 10https://gerrit.wikimedia.org/r/264423 (owner: 10Andrew Bogott) [00:20:33] Lots of api.php?action=mobileview exceptions happening right now [00:25:13] (03PS1) 10Yuvipanda: uwsgi: Restart daemon when config changes properly [puppet] - 10https://gerrit.wikimedia.org/r/264426 (https://phabricator.wikimedia.org/T123438) [00:25:53] (03PS2) 10Yuvipanda: uwsgi: Restart daemon when config changes properly [puppet] - 10https://gerrit.wikimedia.org/r/264426 (https://phabricator.wikimedia.org/T123438) [00:26:42] hoo: looking... "Invalid or virtual namespace -1 given." [00:26:49] yeah [00:27:31] maybe the apps somehow hitting special namespaces? [00:28:22] The one I'm looking at in logstash is for File:Osmosis_diagram.svg [00:28:31] which sound normal [00:28:53] unless we have some encoding bug or something.... [00:29:12] (03CR) 10Yuvipanda: [C: 032] uwsgi: Restart daemon when config changes properly [puppet] - 10https://gerrit.wikimedia.org/r/264426 (https://phabricator.wikimedia.org/T123438) (owner: 10Yuvipanda) [00:29:45] Here's a logstash search -- https://logstash.wikimedia.org/#dashboard/temp/AVJH1h_CptxhN1XaQONf [00:29:53] they are some lots of hosts and wikis [00:31:55] hoo: looks to have started when wmf10 hit wikipedias [00:32:40] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Ubuntu 14.04 from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1938619 (10Dzahn) >>! In T122328#1938358, @ssastry wrote: > * When would you start this? I could start next Tuesday (19th) (or later) > * How long would it take?... [00:33:06] hm [00:33:55] start with that logstash query I linked and zoom out to 2 days [00:33:55] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1938624 (10Dzahn) [00:34:49] also we get a *lot* of action=mobileview requests per unit time [00:35:11] `tail -f api.log |grep mobileview` makes my terminal lag trying to print them all [00:35:47] which also means the error is not an all the time thing. volume isn't high enough [00:39:22] PROBLEM - Auth DNS for labs pdns on labtest-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [00:39:23] spot checking the urls in the errors is looking transient. This one is in logstash but it's working for me -- https://en.wikipedia.org/w/api.php?action=mobileview&format=json&formatversion=2&prop=text|sections|languagecount|thumb|image|id|revision|description|lastmodified|normalizedtitle|displaytitle|protection|editable&onlyrequestedsections=1§ions=0§ionprop=toclevel|line|anchor&noheadings=true&page=File%3AHubble%E2%80%99s_cross-se [00:39:23] ction_of_the_cosmos.jpg&thumbsize=640 [00:40:42] RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [00:40:57] what would make Title::getNamespace() return a negative number? [00:42:08] a namespace that doesn't exist? [00:42:38] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1938644 (10Dzahn) @ssastry the data partition on is on a software RAID across 2 physical disks. i'm afraid we'd have to copy all that data elsewhere. [00:42:40] Special Page? [00:44:01] the errors are coming from mobilefe's DOMParse hook [00:44:41] 7Puppet, 6operations, 5Patch-For-Review: uwsgi puppet module does not seem to trigger restart when config is updated - https://phabricator.wikimedia.org/T123438#1938651 (10yuvipanda) [00:44:52] 7Puppet, 6operations, 5Patch-For-Review: uwsgi puppet module does not seem to trigger restart when config is updated - https://phabricator.wikimedia.org/T123438#1938652 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Was subscribed to wrong thing, fixed now. [00:45:56] which... hasn't changed for a year? [00:47:04] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#1938666 (10Dzahn) a:3Dzahn [00:47:16] 6operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#1938657 (10Dzahn) [00:47:38] Why does TitleParser only have 1 method? [00:48:06] Because we stopped working on TitleValue at some point [00:48:20] (5 minutes before it merged) [00:48:32] lol [00:49:04] $titleParser = self::getTitleParser() [00:49:04] $parts = $titleParser->splitTitleString( $dbkey, $this->getDefaultNamespace() ); [00:49:12] Type hint, return interface [00:49:17] Method used, not in interface [00:54:12] hoo: it looks ugly but not urgent. Have you filed a phab task yet? [00:54:27] Not yet [00:54:35] but someone else probably already has [00:54:49] I'll look and make one if I can't find it [00:59:32] "appInstallID" is in the url. iOS or Android? [01:01:07] The iOS app, yes [01:01:17] github search is damn convenient [01:03:24] yes, yes it it [01:03:30] *is [01:03:40] except when it's out of date [01:03:45] NO YOU HATE FREEDOM YOU SHOULD USE GERRIT [01:04:06] gerrit can't even search one repo [01:04:34] hoo: Filed as https://phabricator.wikimedia.org/T123820 [01:04:43] thanks [01:04:57] YuviPanda: I'd love to use git-grep but E_TOOMANYREPOS [01:05:17] yeah [01:06:03] for MediaWiki things I often just use the stuff that's on terbium/ tin [01:08:17] i use https://github.com/ggreer/the_silver_searcher [01:08:38] lies! nothing is faster than ack [01:08:41] * bd808 looks into it [01:08:49] nobody is using diffusion? [01:09:10] use phab's search. if it only found things [01:09:27] lol [01:09:30] what does phab use as backend? Elastic? [01:10:26] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1938709 (10ssastry) @dzahn yes, T118778 is the puppetization task .. one of the last pieces to be reviewed and merged is https://gerrit.wikimedia.org/r/#/c/26403... [01:11:01] it can but I'm not sure that's what we are using? ostriches worked on it early after the switch and as I recall he stopped trying at some point [01:11:52] are you searching commit messages or code itself? [01:12:23] searching messages works for me on gerrit [01:12:45] I'm gernerally searching code [01:14:19] this looks simple and promising -- https://github.com/coryfklein/multi [01:14:34] git grep in a foreach wrapper [01:14:41] http://git.wikimedia.org/lucene/ :p [01:14:47] as long as gitblit is still alive [01:14:55] eh, it's not , nvm [01:15:04] but see the /lucene/ link [01:15:07] "none of your repositories are configured for Lucene indexing" [01:15:10] :p [01:15:45] but yes, that would be a github killer for me [01:15:51] Probably better... gitblit already is a little... troublesome [01:16:06] well, i tried: https://phabricator.wikimedia.org/T109004 [01:16:08] Having that in Phab would be awesome, though [01:16:26] There's upstream tasks about this for phab [01:18:02] Oh phabs elastic backend was pretty cruddy we tried it. Didn't have time to fix it. [01:34:50] 6operations, 6Parsing-Team, 10Parsoid, 6Services: Update ruthenium to Debian jessie from Ubuntu 12.04 - https://phabricator.wikimedia.org/T122328#1938728 (10GWicke) I think my home directory on ruthenium doesn't contain anything that can't be re-created, and I don't have access to this host any more anyway. [02:05:13] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1938756 (10JanZerebecki) >>! In T52864#1938389, @RobLa-WMF wrote: > Update? Upstream created a new patch level release https://gitlab.com/mailman/mailman/blob/3.0.1/src/mailman/docs/NEWS.rst . No ne... [02:27:21] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: puppet fail [02:27:52] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:11] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:02] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:52] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 65816 bytes in 1.677 second response time [02:29:52] PROBLEM - salt-minion processes on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:01] PROBLEM - nutcracker process on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:11] PROBLEM - HHVM processes on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:11] PROBLEM - Disk space on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:21] PROBLEM - DPKG on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:41] PROBLEM - Check size of conntrack table on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:41] PROBLEM - puppet last run on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:31:01] PROBLEM - RAID on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:31:12] PROBLEM - dhclient process on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:31:22] PROBLEM - SSH on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:42] PROBLEM - configured eth on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:31:53] PROBLEM - nutcracker port on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:22] RECOVERY - Disk space on mw1120 is OK: DISK OK [02:34:23] RECOVERY - HHVM processes on mw1120 is OK: PROCS OK: 6 processes with command name hhvm [02:34:41] RECOVERY - DPKG on mw1120 is OK: All packages OK [02:35:27] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 18m 55s) [02:35:32] RECOVERY - SSH on mw1120 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [02:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:52] RECOVERY - configured eth on mw1120 is OK: OK - interfaces up [02:35:53] RECOVERY - nutcracker port on mw1120 is OK: TCP OK - 0.000 second response time on port 11212 [02:36:12] RECOVERY - salt-minion processes on mw1120 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:36:22] RECOVERY - nutcracker process on mw1120 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:36:52] RECOVERY - Check size of conntrack table on mw1120 is OK: OK: nf_conntrack is 0 % full [02:37:11] RECOVERY - RAID on mw1120 is OK: OK: no RAID installed [02:37:22] RECOVERY - dhclient process on mw1120 is OK: PROCS OK: 0 processes with command name dhclient [02:54:41] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:55:25] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 08m 35s) [02:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:42] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:02:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jan 16 03:02:21 UTC 2016 (duration 6m 57s) [03:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:34:55] (03PS1) 10Anomie: Centralize and add rights and grants in preparation for grants moving into core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 [03:34:57] (03PS1) 10Anomie: Remove $wgMWOAuthGrantPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264438 [03:45:22] RECOVERY - Disk space on stat1002 is OK: DISK OK [04:06:03] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: puppet fail [04:11:23] (03CR) 10Gergő Tisza: Centralize and add rights and grants in preparation for grants moving into core (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie) [04:11:55] (03CR) 10Gergő Tisza: [C: 031] Remove $wgMWOAuthGrantPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264438 (owner: 10Anomie) [04:17:29] (03CR) 10Anomie: Centralize and add rights and grants in preparation for grants moving into core (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie) [04:17:36] (03PS2) 10Anomie: Remove $wgMWOAuthGrantPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264438 [04:17:38] (03PS2) 10Anomie: Centralize and add rights and grants in preparation for grants moving into core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 [04:18:28] (03CR) 10Gergő Tisza: "Now that grants are in core, shouldn't FlaggedRevs/TitleBlacklist/Checkuser/Wikibase set those grants themselves? (Although they would sti" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie) [04:20:17] (03CR) 10Gergő Tisza: [C: 031] Centralize and add rights and grants in preparation for grants moving into core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264437 (owner: 10Anomie) [04:33:31] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:39:42] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [04:52:10] (03PS1) 10Anomie: Install libbytes-random-secure-perl on tool labs [puppet] - 10https://gerrit.wikimedia.org/r/264440 (https://phabricator.wikimedia.org/T123824) [05:07:02] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:28:51] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: puppet fail [05:56:02] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:31:12] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:51] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:32:31] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:42] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:42] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:38] (03PS5) 10ArielGlenn: dumps: set up but don't enable script for dumps to run from cron [puppet] - 10https://gerrit.wikimedia.org/r/263807 (https://phabricator.wikimedia.org/T107750) [06:55:51] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:41] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:52] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:58:03] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:01] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:44] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [09:09:03] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:14:32] PROBLEM - RAID on ms-be2015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:17:52] PROBLEM - very high load average likely xfs on ms-be2015 is CRITICAL: CRITICAL - load average: 167.10, 152.06, 75.34 [09:18:32] RECOVERY - RAID on ms-be2015 is OK: OK: optimal, 14 logical, 14 physical [09:22:02] PROBLEM - Disk space on ms-be2015 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error [09:22:02] RECOVERY - very high load average likely xfs on ms-be2015 is OK: OK - load average: 20.13, 72.32, 59.95 [09:35:52] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Puppet has 1 failures [10:02:12] RECOVERY - Disk space on ms-be2015 is OK: DISK OK [10:58:42] PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: puppet fail [11:19:28] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: language code change for Samogitian: "bat-smg" to "sgs" - https://phabricator.wikimedia.org/T27522#1938953 (10Zordsdavini) The best way it would be to rename project to sgs.wikipedia.org and still have redirect from bat-smg.wikipedia.org [11:24:20] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: language code change for Samogitian: "bat-smg" to "sgs" - https://phabricator.wikimedia.org/T27522#1938959 (10Zordsdavini) If there is very big problem to rename. Please add alias redirect temporary from sgs.wikipedia.org to bat-smg.wikiped... [11:26:42] RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:41:43] 6operations, 10Continuous-Integration-Infrastructure, 7HHVM, 7WorkType-Maintenance: HHVM Jenkins job throw: Unable to set CoreFileSize to 8589934592: Operation not permitted (1) - https://phabricator.wikimedia.org/T78799#1938968 (10Nemo_bis) >>! In T78799#1923965, @hashar wrote: > We always had `hvm.debug.... [13:09:52] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 45 failures [13:37:13] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:46:23] 6operations, 10ops-codfw: ms-be2015.codfw.wmnet: slot=11 dev=sdl failed - https://phabricator.wikimedia.org/T123830#1939033 (10Aklapper) [14:11:01] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 3 failures [14:36:42] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [14:49:44] (03PS1) 10Pmlineditor: Add Category to $wgNamespacesWithSubpages on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264460 (https://phabricator.wikimedia.org/T121985) [14:58:07] (03PS1) 10Suriyaa Kudo: Correct HTML code for WMF image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 [14:59:28] (03CR) 10Luke081515: [C: 031] Add Category to $wgNamespacesWithSubpages on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264460 (https://phabricator.wikimedia.org/T121985) (owner: 10Pmlineditor) [14:59:47] (03PS2) 10Luke081515: Add Category to $wgNamespacesWithSubpages on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264460 (https://phabricator.wikimedia.org/T121985) (owner: 10Pmlineditor) [15:01:24] (03CR) 10Suriyaa Kudo: [C: 031] Correct HTML code for WMF image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo) [15:01:39] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: language code change for Samogitian: "bat-smg" to "sgs" - https://phabricator.wikimedia.org/T27522#1939083 (10Krenair) There should definitely be redirects, yes. [15:06:01] (03PS2) 10Luke081515: Add namespace aliases for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264066 (https://phabricator.wikimedia.org/T123187) (owner: 10Pmlineditor) [15:06:11] (03CR) 10Luke081515: [C: 031] Add namespace aliases for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264066 (https://phabricator.wikimedia.org/T123187) (owner: 10Pmlineditor) [15:22:53] !log restarted mysql on bohrium because it had stopped working (probably due to piwik performance problems) [15:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:57:35] !log piwik is taking events on bohrium but the interface can't complete the queries to load because there's too much data. Mysql is maxing the CPU but it seems ok for now, will check again Monday. [15:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:21:23] (03Abandoned) 10Milimetric: Revert "Blacklist MobileWebSectionUsage from mysql" [puppet] - 10https://gerrit.wikimedia.org/r/263549 (owner: 10Milimetric) [16:23:10] (03PS1) 10Milimetric: Remove MobileWebSectionUsage from blacklist [puppet] - 10https://gerrit.wikimedia.org/r/264465 [16:35:21] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [16:39:32] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:50:11] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:58:41] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:07:12] (03CR) 10ArielGlenn: [C: 032] dumps multi stream bz2 files: fix corruption in writing end of BZ2 stream [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/264337 (https://phabricator.wikimedia.org/T121348) (owner: 10ArielGlenn) [17:07:12] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:09:36] (03PS1) 10KartikMistry: Add missing entry to disable file upload on guwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264469 [17:11:31] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:32:41] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:43:02] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:59:52] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:12:32] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:18:51] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:39:22] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: puppet fail [18:40:31] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 18 failures [19:05:11] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:06:42] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:07:12] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:07:52] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:12:32] 6operations, 10Continuous-Integration-Infrastructure, 7HHVM, 7WorkType-Maintenance: HHVM Jenkins job throw: Unable to set CoreFileSize to 8589934592: Operation not permitted (1) - https://phabricator.wikimedia.org/T78799#1939173 (10bd808) >>! In T78799#1938968, @Nemo_bis wrote: >>>! In T78799#1923965, @has... [19:24:20] (03PS1) 10Andrew Bogott: Remove reference to labs_nova_controller_other [puppet] - 10https://gerrit.wikimedia.org/r/264473 (https://phabricator.wikimedia.org/T123790) [19:24:22] (03PS1) 10Andrew Bogott: Rename labcontrol2001 to labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264474 (https://phabricator.wikimedia.org/T123790) [19:26:39] 6operations, 10ops-codfw, 6Labs, 5Patch-For-Review: Update tag and racktables for labcontrol2001: renamed to labtestweb2001 - https://phabricator.wikimedia.org/T123841#1939187 (10Andrew) 3NEW [19:29:17] (03CR) 10Andrew Bogott: [C: 032] Remove reference to labs_nova_controller_other [puppet] - 10https://gerrit.wikimedia.org/r/264473 (https://phabricator.wikimedia.org/T123790) (owner: 10Andrew Bogott) [19:29:24] (03PS1) 10Andrew Bogott: Remove labs-puppetmaster-codfw alias [dns] - 10https://gerrit.wikimedia.org/r/264476 [19:29:26] (03PS1) 10Andrew Bogott: Rename labcontrol2001 to labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/264477 (https://phabricator.wikimedia.org/T123790) [19:34:42] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:35:40] (03CR) 10Andrew Bogott: [C: 032] Rename labcontrol2001 to labtestweb2001 [puppet] - 10https://gerrit.wikimedia.org/r/264474 (https://phabricator.wikimedia.org/T123790) (owner: 10Andrew Bogott) [19:36:18] (03PS2) 10Andrew Bogott: Remove labs-puppetmaster-codfw alias [dns] - 10https://gerrit.wikimedia.org/r/264476 [19:38:19] (03CR) 10Andrew Bogott: [C: 032] Remove labs-puppetmaster-codfw alias [dns] - 10https://gerrit.wikimedia.org/r/264476 (owner: 10Andrew Bogott) [19:38:32] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [19:38:57] (03PS2) 10Andrew Bogott: Rename labcontrol2001 to labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/264477 (https://phabricator.wikimedia.org/T123790) [19:41:01] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:41:21] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [19:41:32] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 1 failures [19:45:32] (03PS3) 10Andrew Bogott: Rename labcontrol2001 to labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/264477 (https://phabricator.wikimedia.org/T123790) [19:46:36] (03CR) 10Andrew Bogott: [C: 032] Rename labcontrol2001 to labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/264477 (https://phabricator.wikimedia.org/T123790) (owner: 10Andrew Bogott) [19:51:02] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [19:52:47] !log renaming and reimaging labcontrol2001 -> labtestweb2001 [19:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:32] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:55:41] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:04:16] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [20:07:36] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:09:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:10:56] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:11:17] PROBLEM - salt-minion processes on tin is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:16:07] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [20:19:08] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:19:57] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:24:27] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [20:27:07] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:29:37] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:31:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:33:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:34:54] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:35:04] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:05] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:15] PROBLEM - Check size of conntrack table on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:35] PROBLEM - DPKG on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:54] PROBLEM - configured eth on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:14] PROBLEM - Disk space on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:35] PROBLEM - HHVM processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:44] PROBLEM - dhclient process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:45] PROBLEM - RAID on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:05] RECOVERY - salt-minion processes on tin is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:38:05] PROBLEM - SSH on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:24] PROBLEM - nutcracker port on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:44] PROBLEM - nutcracker process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:39:16] PROBLEM - salt-minion processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:30] 6operations, 10Continuous-Integration-Infrastructure, 7HHVM, 7WorkType-Maintenance: HHVM Jenkins job throw: Unable to set CoreFileSize to 8589934592: Operation not permitted (1) - https://phabricator.wikimedia.org/T78799#1939272 (10Nemo_bis) So in other words that's useless for anyone running unit tests ot... [20:49:35] RECOVERY - Disk space on mw1148 is OK: DISK OK [20:49:35] RECOVERY - salt-minion processes on mw1148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:50:05] RECOVERY - HHVM processes on mw1148 is OK: PROCS OK: 6 processes with command name hhvm [20:50:05] RECOVERY - dhclient process on mw1148 is OK: PROCS OK: 0 processes with command name dhclient [20:50:15] RECOVERY - RAID on mw1148 is OK: OK: no RAID installed [20:50:34] RECOVERY - SSH on mw1148 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [20:50:45] RECOVERY - Check size of conntrack table on mw1148 is OK: OK: nf_conntrack is 0 % full [20:50:45] RECOVERY - nutcracker port on mw1148 is OK: TCP OK - 0.000 second response time on port 11212 [20:51:05] RECOVERY - nutcracker process on mw1148 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:51:05] RECOVERY - DPKG on mw1148 is OK: All packages OK [20:51:24] RECOVERY - configured eth on mw1148 is OK: OK - interfaces up [20:53:25] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:52:04] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:56:05] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:57:28] (03PS1) 10Nemo bis: [Planet Wikimedia] Add Andrew Gray and William Beutler [puppet] - 10https://gerrit.wikimedia.org/r/264543 [22:02:45] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:04:44] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:13:14] (03PS1) 10Nemo bis: [Planet Wikimedia] Update Pau Giner domain [puppet] - 10https://gerrit.wikimedia.org/r/264544 [22:15:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:15:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:19:25] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:54] PROBLEM - restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:44] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:25:45] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:45] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:32:14] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:32:24] RECOVERY - restbase endpoints health on restbase1003 is OK: All endpoints are healthy [22:32:26] PROBLEM - restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:34:16] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:34:54] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:36:34] RECOVERY - restbase endpoints health on restbase1001 is OK: All endpoints are healthy [22:36:44] PROBLEM - restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:25] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:38:34] RECOVERY - restbase endpoints health on restbase1004 is OK: All endpoints are healthy [22:38:44] PROBLEM - restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:40:44] RECOVERY - restbase endpoints health on restbase1003 is OK: All endpoints are healthy [22:42:34] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:44:05] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:44:56] PROBLEM - restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:46:04] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [22:46:45] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:46:55] RECOVERY - restbase endpoints health on restbase1004 is OK: All endpoints are healthy [22:47:05] PROBLEM - restbase endpoints health on restbase1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:55] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:49:44] (03PS1) 10Andrew Bogott: Turn on 'send_puppet_failure_emails' in testlabs [puppet] - 10https://gerrit.wikimedia.org/r/264545 [22:50:55] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:51:05] RECOVERY - restbase endpoints health on restbase1003 is OK: All endpoints are healthy [22:51:29] (03CR) 10Andrew Bogott: [C: 032] Turn on 'send_puppet_failure_emails' in testlabs [puppet] - 10https://gerrit.wikimedia.org/r/264545 (owner: 10Andrew Bogott) [22:53:05] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:53:24] PROBLEM - restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:55:25] PROBLEM - restbase endpoints health on restbase1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:56:26] (03CR) 10Aklapper: "@Faidon: Is this still wanted? If so, should this get merged? Asking as this has been rotting here for more than a year without a review.." [dns] - 10https://gerrit.wikimedia.org/r/143762 (owner: 10Faidon Liambotis) [22:56:35] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:34] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [23:01:54] RECOVERY - restbase endpoints health on restbase1001 is OK: All endpoints are healthy [23:05:06] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:07:05] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [23:07:55] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:15] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:13:25] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:16:14] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:16:25] RECOVERY - restbase endpoints health on restbase1004 is OK: All endpoints are healthy [23:19:44] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [23:22:34] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [23:22:45] PROBLEM - restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:24:45] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:24:45] RECOVERY - restbase endpoints health on restbase1004 is OK: All endpoints are healthy [23:28:05] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:28:16] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: puppet fail [23:31:15] PROBLEM - restbase endpoints health on restbase1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.159, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [23:32:05] PROBLEM - restbase endpoints health on restbase1005 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.48.99, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [23:32:35] PROBLEM - Restbase root url on restbase1003 is CRITICAL: Connection refused [23:33:14] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:33:54] PROBLEM - Restbase root url on restbase1005 is CRITICAL: Connection refused [23:36:44] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [23:36:54] RECOVERY - Restbase root url on restbase1003 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.006 second response time [23:37:35] RECOVERY - restbase endpoints health on restbase1003 is OK: All endpoints are healthy [23:43:04] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:44:54] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [23:50:34] RECOVERY - Restbase root url on restbase1005 is OK: HTTP OK: HTTP/1.1 200 - 15214 bytes in 0.004 second response time [23:50:54] RECOVERY - restbase endpoints health on restbase1005 is OK: All endpoints are healthy [23:52:06] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:53:35] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [23:54:06] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy