[00:01:34] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [00:02:09] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2387712 (10Papaul) @fgiunchedi the other ms-be* systems have Trusty installed are we also installing Trusty on the new systems of Jessie? Thanks [00:02:14] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:13] PROBLEM - HHVM rendering on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:20] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2387714 (10Danny_B) 05Open>03Resolved Rules are written. Deploying them is another task. [00:03:54] PROBLEM - nutcracker port on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:04:15] PROBLEM - nutcracker process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:04:24] PROBLEM - Check size of conntrack table on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:04:33] PROBLEM - configured eth on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:04:43] PROBLEM - DPKG on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:04:53] PROBLEM - puppet last run on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:03] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2387718 (10Paladox) The above works. git.wmflabs.org [00:05:03] PROBLEM - Disk space on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:04] PROBLEM - salt-minion processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:34] PROBLEM - dhclient process on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:44] PROBLEM - SSH on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:06:03] PROBLEM - HHVM processes on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:09:43] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2387721 (10Paladox) [00:14:35] (03PS1) 10Papaul: DHCP: Add MAC address entries for ms-be202[2-7] Bug:T136630 [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) [00:19:10] (03PS2) 10Papaul: DHCP: Add MAC address entries for ms-be202[2-7] Bug:T136630 [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) [00:27:34] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:28:33] RECOVERY - DPKG on mw1133 is OK: All packages OK [00:28:34] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 57 minutes ago with 0 failures [00:28:44] RECOVERY - Disk space on mw1133 is OK: DISK OK [00:28:44] RECOVERY - salt-minion processes on mw1133 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:29:13] (03PS1) 10Papaul: adding install params for ms-be202[2-7] Bug:T136630 [puppet] - 10https://gerrit.wikimedia.org/r/294866 (https://phabricator.wikimedia.org/T136630) [00:29:14] RECOVERY - dhclient process on mw1133 is OK: PROCS OK: 0 processes with command name dhclient [00:29:26] RECOVERY - SSH on mw1133 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [00:29:43] RECOVERY - HHVM processes on mw1133 is OK: PROCS OK: 6 processes with command name hhvm [00:29:45] RECOVERY - nutcracker port on mw1133 is OK: TCP OK - 0.000 second response time on port 11212 [00:30:14] RECOVERY - nutcracker process on mw1133 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:30:24] RECOVERY - Check size of conntrack table on mw1133 is OK: OK: nf_conntrack is 0 % full [00:30:24] RECOVERY - configured eth on mw1133 is OK: OK - interfaces up [00:32:47] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2387770 (10Papaul) [00:37:24] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 109 failures [01:01:58] (03PS1) 1020after4: Rewrite rules for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) [01:03:37] (03PS2) 1020after4: Rewrite rules for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) [01:20:27] (03PS2) 10Yuvipanda: Fix default 'type' behavior [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294844 [01:20:29] (03PS5) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 [02:00:10] (03PS6) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 [02:00:12] (03PS1) 10Yuvipanda: Cleanup backend: field too when killing webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294872 [02:03:51] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [02:24:34] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.6) (duration: 09m 46s) [02:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:32] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [02:31:00] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jun 17 02:31:00 UTC 2016 (duration 6m 26s) [02:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:47] PROBLEM - HHVM jobrunner on mw2248 is CRITICAL: Connection timed out [03:05:47] PROBLEM - HHVM jobrunner on mw2249 is CRITICAL: Connection timed out [03:05:47] PROBLEM - HHVM jobrunner on mw2250 is CRITICAL: Connection timed out [03:05:56] PROBLEM - nutcracker process on mw2250 is CRITICAL: Timeout while attempting connection [03:05:56] PROBLEM - nutcracker process on mw2248 is CRITICAL: Timeout while attempting connection [03:05:56] PROBLEM - nutcracker process on mw2249 is CRITICAL: Timeout while attempting connection [03:06:16] PROBLEM - puppet last run on mw2250 is CRITICAL: Timeout while attempting connection [03:06:16] PROBLEM - puppet last run on mw2249 is CRITICAL: Timeout while attempting connection [03:06:16] PROBLEM - puppet last run on mw2248 is CRITICAL: Timeout while attempting connection [03:06:46] PROBLEM - salt-minion processes on mw2249 is CRITICAL: Timeout while attempting connection [03:06:46] PROBLEM - salt-minion processes on mw2250 is CRITICAL: Timeout while attempting connection [03:06:46] PROBLEM - salt-minion processes on mw2248 is CRITICAL: Timeout while attempting connection [03:07:06] PROBLEM - Check size of conntrack table on mw2248 is CRITICAL: Timeout while attempting connection [03:07:06] PROBLEM - Check size of conntrack table on mw2249 is CRITICAL: Timeout while attempting connection [03:07:06] PROBLEM - Check size of conntrack table on mw2250 is CRITICAL: Timeout while attempting connection [03:07:16] PROBLEM - DPKG on mw2249 is CRITICAL: Timeout while attempting connection [03:07:16] PROBLEM - DPKG on mw2250 is CRITICAL: Timeout while attempting connection [03:07:16] PROBLEM - DPKG on mw2248 is CRITICAL: Timeout while attempting connection [03:07:28] PROBLEM - Disk space on mw2248 is CRITICAL: Timeout while attempting connection [03:07:28] PROBLEM - Disk space on mw2250 is CRITICAL: Timeout while attempting connection [03:07:28] PROBLEM - Disk space on mw2249 is CRITICAL: Timeout while attempting connection [03:08:08] PROBLEM - MD RAID on mw2248 is CRITICAL: Timeout while attempting connection [03:08:08] PROBLEM - MD RAID on mw2249 is CRITICAL: Timeout while attempting connection [03:08:08] PROBLEM - MD RAID on mw2250 is CRITICAL: Timeout while attempting connection [03:08:57] PROBLEM - configured eth on mw2250 is CRITICAL: Timeout while attempting connection [03:08:57] PROBLEM - configured eth on mw2248 is CRITICAL: Timeout while attempting connection [03:08:57] PROBLEM - configured eth on mw2249 is CRITICAL: Timeout while attempting connection [03:09:16] PROBLEM - dhclient process on mw2249 is CRITICAL: Timeout while attempting connection [03:09:16] PROBLEM - dhclient process on mw2248 is CRITICAL: Timeout while attempting connection [03:09:16] PROBLEM - dhclient process on mw2250 is CRITICAL: Timeout while attempting connection [03:09:26] PROBLEM - mediawiki-installation DSH group on mw2248 is CRITICAL: Host mw2248 is not in mediawiki-installation dsh group [03:09:26] PROBLEM - mediawiki-installation DSH group on mw2249 is CRITICAL: Host mw2249 is not in mediawiki-installation dsh group [03:09:26] PROBLEM - mediawiki-installation DSH group on mw2250 is CRITICAL: Host mw2250 is not in mediawiki-installation dsh group [03:09:47] PROBLEM - nutcracker port on mw2248 is CRITICAL: Timeout while attempting connection [03:09:47] PROBLEM - nutcracker port on mw2249 is CRITICAL: Timeout while attempting connection [03:09:47] PROBLEM - nutcracker port on mw2250 is CRITICAL: Timeout while attempting connection [03:19:20] (03PS1) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) [03:30:26] RECOVERY - Disk space on mw2248 is OK: DISK OK [03:30:37] RECOVERY - nutcracker port on mw2248 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [03:30:56] RECOVERY - nutcracker process on mw2248 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [03:31:07] RECOVERY - MD RAID on mw2248 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:31:47] RECOVERY - salt-minion processes on mw2248 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:31:47] RECOVERY - salt-minion processes on mw2249 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:32:16] RECOVERY - configured eth on mw2249 is OK: OK - interfaces up [03:32:16] RECOVERY - configured eth on mw2248 is OK: OK - interfaces up [03:32:47] RECOVERY - dhclient process on mw2248 is OK: PROCS OK: 0 processes with command name dhclient [03:33:26] RECOVERY - Check size of conntrack table on mw2249 is OK: OK: nf_conntrack is 0 % full [03:33:36] RECOVERY - DPKG on mw2249 is OK: All packages OK [03:33:48] RECOVERY - Disk space on mw2249 is OK: DISK OK [03:34:26] RECOVERY - MD RAID on mw2249 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:34:37] RECOVERY - Check size of conntrack table on mw2248 is OK: OK: nf_conntrack is 0 % full [03:34:56] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2387881 (10KartikMistry) [03:34:56] RECOVERY - dhclient process on mw2249 is OK: PROCS OK: 0 processes with command name dhclient [03:34:56] RECOVERY - DPKG on mw2248 is OK: All packages OK [03:35:07] RECOVERY - nutcracker port on mw2249 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [03:35:27] RECOVERY - nutcracker process on mw2249 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [03:39:28] RECOVERY - HHVM jobrunner on mw2248 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.089 second response time [03:40:16] RECOVERY - HHVM jobrunner on mw2249 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.101 second response time [03:40:21] (03CR) 10Yuvipanda: [C: 032] Exit when given unsupported parameters [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294843 (owner: 10Yuvipanda) [03:40:33] (03CR) 10Yuvipanda: [C: 032] Fix default 'type' behavior [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294844 (owner: 10Yuvipanda) [03:40:46] (03CR) 10Yuvipanda: [C: 032] Cleanup backend: field too when killing webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294872 (owner: 10Yuvipanda) [03:40:59] (03CR) 10Yuvipanda: [C: 032] Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294741 (owner: 10Yuvipanda) [03:41:58] PROBLEM - NTP on mw2248 is CRITICAL: NTP CRITICAL: Offset unknown [03:41:58] PROBLEM - NTP on mw2249 is CRITICAL: NTP CRITICAL: Offset unknown [03:42:07] PROBLEM - NTP on mw2250 is CRITICAL: NTP CRITICAL: Offset unknown [03:43:16] RECOVERY - MD RAID on mw2250 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:43:27] RECOVERY - configured eth on mw2250 is OK: OK - interfaces up [03:43:46] RECOVERY - dhclient process on mw2250 is OK: PROCS OK: 0 processes with command name dhclient [03:43:57] RECOVERY - Check size of conntrack table on mw2250 is OK: OK: nf_conntrack is 0 % full [03:44:07] RECOVERY - nutcracker port on mw2250 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [03:44:26] RECOVERY - nutcracker process on mw2250 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [03:44:28] RECOVERY - Disk space on mw2250 is OK: DISK OK [03:44:47] RECOVERY - salt-minion processes on mw2250 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:46:08] RECOVERY - NTP on mw2248 is OK: NTP OK: Offset 0.002550601959 secs [03:46:17] RECOVERY - DPKG on mw2250 is OK: All packages OK [03:48:17] RECOVERY - NTP on mw2249 is OK: NTP OK: Offset -6.330013275e-05 secs [03:51:07] RECOVERY - HHVM jobrunner on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.082 second response time [03:58:46] RECOVERY - NTP on mw2250 is OK: NTP OK: Offset -0.001456141472 secs [04:26:21] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387918 (10ori) [04:30:37] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [04:57:08] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:57:50] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2387932 (10GWicke) Varnish -> RB GET p95 latency: {F4173733} [05:35:44] (03CR) 10Giuseppe Lavagetto: "@bblack it's my full intention to do that; I made this behave like the standard class if one doesn't specify a reason in enabling puppet." [puppet] - 10https://gerrit.wikimedia.org/r/294694 (owner: 10Giuseppe Lavagetto) [05:41:33] (03PS2) 10Giuseppe Lavagetto: conftool: add new jessie api appservers [puppet] - 10https://gerrit.wikimedia.org/r/294737 [05:43:34] (03PS1) 10Yuvipanda: Fix terrible typo in status check for restarts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294877 [05:44:21] (03PS1) 10Yuvipanda: Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294878 [05:48:58] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:07] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:19] (03CR) 10Yuvipanda: [C: 032] Fix terrible typo in status check for restarts [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294877 (owner: 10Yuvipanda) [05:51:30] (03CR) 10Yuvipanda: [C: 032] Bump deb version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/294878 (owner: 10Yuvipanda) [05:53:27] PROBLEM - zotero on sca1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:55:36] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add new jessie api appservers [puppet] - 10https://gerrit.wikimedia.org/r/294737 (owner: 10Giuseppe Lavagetto) [05:57:47] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (web site in alternative language) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zot [05:58:31] !log root@palladium conftool action : set/pooled=yes:weight=20; selector: cluster=api_appserver,name=mw127.* [05:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:03:46] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (web site in alternative language) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zot [06:04:01] !log root@palladium conftool action : set/weight=25; selector: cluster=api_appserver,name=mw127.* [06:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:07:26] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zotero alive) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [06:08:05] <_joe_> uhm this is bad [06:08:45] <_joe_> but also not true [06:08:49] <_joe_> citoid is working [06:09:03] <_joe_> so this is a problem with neon I'd say [06:10:25] <_joe_> it's zotero that is down, apparently, looking [06:12:19] PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (web site in alternative language) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api: /api (Zot [06:14:20] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.013 second response time [06:15:29] [JavaScript Error: "uncaught exception: out of memory"] [06:15:52] virt 28.319g, resident 0.027t [06:15:54] restarting it [06:16:02] <_joe_> yes [06:16:09] <_joe_> that's sca1002? [06:16:14] <_joe_> I just restarted it on 1001 [06:16:16] ok [06:16:19] yup [06:16:31] ah that's why it's fine on sca1001 and it recovered [06:16:40] !log restarted zotero on sca1002 [06:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:16:45] (03PS2) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) [06:16:45] <_joe_> yes, I was about to !log [06:16:50] !log _joe_ restarted zotero on sca1001 [06:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:16:58] <_joe_> thanks [06:17:10] <_joe_> so the citoid LVS alert is actually working [06:17:16] RECOVERY - zotero on sca1002 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.011 second response time [06:17:19] (03CR) 10jenkins-bot: [V: 04-1] Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [06:17:25] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [06:17:25] depends [06:17:26] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [06:17:26] <_joe_> It caught the problem well before the zotero lvs did [06:17:36] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [06:17:43] it probably is that gov database again [06:17:45] <_joe_> ofc it's a symptom [06:17:51] <_joe_> akosiaris: an OOM? [06:18:13] <_joe_> btw I don't know why pybal didn't depool at least one of the zotero hosts [06:18:21] no, it never ooms [06:18:21] <_joe_> proabbly we don't do proper checks [06:18:52] oh, the check I 've managed to do with zotero is quite simple [06:19:00] so https://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Service+Cluster+A+eqiad&m=cpu_report&s=by+name&mc=2&g=mem_report [06:19:17] <_joe_> sigh [06:19:22] I am this close to putting a cronjob to restart zotero once a week [06:19:25] <_joe_> no I mean pybal checks [06:19:36] <_joe_> akosiaris: see what I did for jobrunners :) [06:19:41] heh [06:19:47] <_joe_> as far as cron restarts go [06:19:55] <_joe_> I am thinking of extending that to the API cluster [06:20:02] (03PS3) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) [06:20:39] ah the zotero pybal check is IdleConnection [06:20:49] ProxyFetch is basically impossible IIRC [06:20:51] <_joe_> akosiaris: right [06:20:54] <_joe_> why? [06:20:55] zotero only accepts POSTs [06:21:02] <_joe_> well let me try :) [06:21:04] no GETs [06:21:25] <_joe_> and we don't do POST in ProxyFetch [06:21:31] <_joe_> well, we must add that [06:21:41] <_joe_> if anyone opens a ticket on that :P [06:22:39] $USER1$/check_http -I $HOSTADDRESS$ -H $ARG1$ -p $ARG2$ -P '[{"itemType":"journalArticle"}]' -T 'application/json' -u "$ARG3$" [06:22:43] that the zotero nagios check [06:22:49] that actually worked from what I see [06:23:09] <_joe_> yes [06:23:15] (08:53:27 πμ) icinga-wm: PROBLEM - zotero on sca1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:19] and very very early [06:23:23] <_joe_> yes [06:23:35] 20 whole minutes before LVS paged [06:24:01] <_joe_> and 5 before the functional test on citoid lvs paged [06:24:17] <_joe_> we should add more functional checks to LVSs [06:24:22] no that's the host check $USER1$/check_http -H $HOSTADDRESS$ -p $ARG1$ -P '[{"itemType":"journalArticle"}]' -T 'application/json' -u /export?format=wikipedia [06:24:24] <_joe_> and make them page [06:24:29] the other one I pasted is the LVS check [06:24:35] not that they differ a lot [06:24:48] <_joe_> nope [06:25:33] anyway, back to breakfast, bbl [06:26:37] <_joe_> heh, I should do breakfast as well [06:26:45] zotero again? [06:27:33] mobrovac: yup. out of memory and the xul engine logged but that's all it does about it [06:27:42] lol [06:28:01] "fyi i'm out of mem, but i'll continue" [06:30:57] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:47] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:30] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2386622 (10ori) From reading the schema, it looks like you're reporting an average for all requests made in the c... [06:32:35] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:46] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:03] (03PS1) 10Mobrovac: Revert "Change Prop: Disable transclusion update rules" [puppet] - 10https://gerrit.wikimedia.org/r/294880 [06:40:19] !log installing apache update on palladium [06:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:44:52] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: puppet fail [06:44:53] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: puppet fail [06:44:53] PROBLEM - puppet last run on es1013 is CRITICAL: CRITICAL: Puppet has 29 failures [06:45:02] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Puppet has 48 failures [06:45:02] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: puppet fail [06:45:03] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: puppet fail [06:45:12] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Puppet has 5 failures [06:45:15] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2387980 (10jcrespo) Do do know when was flow enabled for the first time/what is the oldest content we will find? [06:45:22] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: puppet fail [06:45:22] PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: Puppet has 49 failures [06:45:23] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: puppet fail [06:45:32] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: puppet fail [06:45:33] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: puppet fail [06:45:33] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: puppet fail [06:45:33] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:34] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Puppet has 33 failures [06:45:42] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: puppet fail [06:45:52] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: puppet fail [06:45:53] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 53 failures [06:45:53] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: Puppet has 27 failures [06:45:53] PROBLEM - puppet last run on restbase2007 is CRITICAL: CRITICAL: Puppet has 32 failures [06:45:53] PROBLEM - puppet last run on elastic2022 is CRITICAL: CRITICAL: puppet fail [06:45:54] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 52 failures [06:45:54] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: puppet fail [06:45:54] PROBLEM - puppet last run on mw2065 is CRITICAL: CRITICAL: Puppet has 88 failures [06:46:02] PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: Puppet has 75 failures [06:46:02] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 92 failures [06:46:03] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: puppet fail [06:46:03] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: puppet fail [06:46:03] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 71 failures [06:46:13] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 60 failures [06:46:13] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: puppet fail [06:46:22] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: puppet fail [06:46:22] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Puppet has 35 failures [06:46:32] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 90 failures [06:46:33] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: puppet fail [06:46:33] PROBLEM - puppet last run on mw2225 is CRITICAL: CRITICAL: puppet fail [06:46:34] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Puppet has 32 failures [06:46:43] PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Puppet has 36 failures [06:46:43] PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Puppet has 4 failures [06:46:43] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 38 failures [06:46:43] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 35 failures [06:46:43] PROBLEM - puppet last run on mw1265 is CRITICAL: CRITICAL: Puppet has 136 failures [06:46:53] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 37 failures [06:47:22] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 60 failures [06:47:23] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 61 failures [06:48:03] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 33 failures [06:50:13] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:03] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:01] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Change Prop: Disable transclusion update rules" [puppet] - 10https://gerrit.wikimedia.org/r/294880 (owner: 10Mobrovac) [07:02:56] !log change-prop restarting it to apply https://gerrit.wikimedia.org/r/294880 [07:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:10:04] PROBLEM - DPKG on hafnium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:10:05] PROBLEM - DPKG on bohrium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:10:45] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:10:45] RECOVERY - puppet last run on restbase2007 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:10:46] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:10:55] RECOVERY - puppet last run on maps-test2002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:10:55] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:11:04] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:11:06] (03PS2) 10Giuseppe Lavagetto: Fix sorting of hostnames [dns] - 10https://gerrit.wikimedia.org/r/293282 [07:11:15] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:15] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:11:16] RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [07:11:24] RECOVERY - puppet last run on es1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:25] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:11:25] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:11:29] !log restbase started mobile-sections dump on restbase1009 for T136964 [07:11:30] T136964: Pre-generate/purge mobile-sections endpoints to fix page links inside image captions - https://phabricator.wikimedia.org/T136964 [07:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:11:35] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:11:35] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:11:36] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:11:44] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:11:45] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:11:55] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:11:55] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:12:05] RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:12:05] RECOVERY - DPKG on hafnium is OK: All packages OK [07:12:05] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [07:12:05] RECOVERY - DPKG on bohrium is OK: All packages OK [07:12:16] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:16] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:12:16] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:25] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:12:25] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:25] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:12:26] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:12:26] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix sorting of hostnames [dns] - 10https://gerrit.wikimedia.org/r/293282 (owner: 10Giuseppe Lavagetto) [07:12:35] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:35] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:36] RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:12:44] RECOVERY - puppet last run on elastic2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:45] RECOVERY - puppet last run on mw2065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:46] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:12:46] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:12:55] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:13:05] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:13:05] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:13:05] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:13:24] RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:13:25] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:13:25] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:13:55] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:14:05] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:14:36] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:14:49] (03PS1) 10Jcrespo: Depool db1072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294887 [07:16:30] (03CR) 10Jcrespo: [C: 032] Depool db1072 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294887 (owner: 10Jcrespo) [07:18:42] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 for maintenance (duration: 00m 31s) [07:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:39] (03Abandoned) 10Giuseppe Lavagetto: puppet: install msgpack and allow switching it on/off [puppet] - 10https://gerrit.wikimedia.org/r/286141 (owner: 10Giuseppe Lavagetto) [07:19:45] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: puppet fail [07:23:50] !log backuping and reimaging db1072 [07:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:25:32] (03CR) 10Giuseppe Lavagetto: "The reason this is not part of base::service_unit is that it's systemd only." [puppet] - 10https://gerrit.wikimedia.org/r/291949 (owner: 10Giuseppe Lavagetto) [07:26:28] (03PS4) 10Muehlenhoff: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [07:26:42] (03PS3) 10Giuseppe Lavagetto: systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 [07:29:38] (03PS4) 10Giuseppe Lavagetto: systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 [07:30:15] (03CR) 10Nikerabbit: [C: 04-1] Deploy Compact Language Links as default (Stage 1) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [07:32:05] (03PS4) 10Giuseppe Lavagetto: salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 [07:33:05] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [07:34:35] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388011 (10matthiasmullie) The very first Flow commit was Wed Jul 10 23:05:11 2013 +0100. It was enabled on labs o... [07:40:40] (03PS10) 10Yuvipanda: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [07:40:52] (03CR) 10Yuvipanda: [C: 032] "Gonna meeeerge!" [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [07:41:01] (03CR) 10Yuvipanda: [V: 032] "Gonna meeeerge!" [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [07:42:40] (03PS4) 10Yuvipanda: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [07:43:17] (03PS5) 10Giuseppe Lavagetto: salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 [07:44:33] (03PS1) 10Yuvipanda: tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 [07:45:00] (03PS4) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) [07:45:35] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:45:38] (03CR) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [07:46:36] (03CR) 10Giuseppe Lavagetto: [C: 032] salt: add wmfpuppet module [puppet] - 10https://gerrit.wikimedia.org/r/294694 (owner: 10Giuseppe Lavagetto) [07:46:40] (03PS2) 10Yuvipanda: tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 [07:46:54] joe can you +1 ^? (just a Restart=always) [07:47:39] (03CR) 10Giuseppe Lavagetto: [C: 031] tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 (owner: 10Yuvipanda) [07:47:51] joe thanks [07:48:02] (03PS3) 10Yuvipanda: tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 [07:48:14] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Restart kube2proxy when it dies [puppet] - 10https://gerrit.wikimedia.org/r/294888 (owner: 10Yuvipanda) [07:55:31] 06Operations, 06Discovery, 06Services, 03Maps-Sprint, 13Patch-For-Review: Allow configuration of contact groups for monitoring of services - https://phabricator.wikimedia.org/T137891#2382756 (10mobrovac) I'm not a fan of this solution. The defaults come from hiera, so can't you just change the hiera valu... [07:59:42] (03PS1) 10Yuvipanda: Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 [08:00:01] (03CR) 10jenkins-bot: [V: 04-1] Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 (owner: 10Yuvipanda) [08:00:03] (03PS2) 10Yuvipanda: Add a static-web [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/294889 [08:10:01] (03CR) 10WMDE-Fisch: [C: 031] tools: Install jdk8 in trusty nodes [puppet] - 10https://gerrit.wikimedia.org/r/292960 (https://phabricator.wikimedia.org/T121279) (owner: 10Yuvipanda) [08:11:14] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2368961 (10Paladox) But I think this task is also about uploaded the rewrites and then remove the server behind git.w... [08:12:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:16:22] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [08:23:16] 06Operations: Create backup/restore scripts for etcd - https://phabricator.wikimedia.org/T135129#2388070 (10Joe) a:03Joe [08:24:54] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388071 (10greg) 05Resolved>03Open >>! In T137224#2387714, @Danny_B wrote: > Rules are written. Deploying them is... [08:29:06] !log Restarting Jenkins on gallium. Web interface at least is deadlocked somehow [08:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:30:01] probably it is not jenkins but git [08:30:37] yeah I havee seen a defunct git child [08:30:49] and jstack did not produce anything helpful :( [08:33:36] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388125 (10greg) These, I believe, should be merged for this to be done: * https://gerrit.wikimedia.org/r/#/c/293789/... [08:34:02] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 57 failures [08:36:32] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2388141 (10Gehel) Referrer check has been removed (T137848) which allow more experimentation to happen. I'd still like to validate our automation of setting up new nodes and the rel... [08:40:08] gerrit seems to be down [08:41:00] and labs [08:41:19] DNS issues [08:41:22] Amir1: seems to work for me... [08:42:06] Amir1: I just checked the gerrit web ui and pulling a repo, there might be other issues... [08:42:15] gehel: maybe that's a MENA issue [08:42:25] back up now [08:42:35] Amir1: MENA ? [08:42:37] Lydia_WMDE and I couldn't connet [08:42:40] * gehel is not good with acronyms [08:42:41] *connect [08:42:52] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [08:42:57] Middle East and North Africa [08:43:14] also having problem from europe, no dns or slow loading [08:43:22] I'm in Berlin though now [08:44:09] hey [08:44:12] traceroutes? [08:44:16] and your IP [08:44:39] Amir1, Nikerabbit ^^ [08:44:56] <_joe_> no dns for what specifically? [08:45:34] doing it right now [08:45:42] Failed to resolve host: No address associated with hostname [08:45:47] $ mtr -w -c 20 wmflabs.org [08:45:51] but now is back [08:45:51] I need your IPs and traceroutes [08:46:04] https://www.irccloud.com/pastebin/bry86UjK/ [08:46:42] my IP: 80.153.119.142 [08:46:45] there's v4 icmp failures from atlas too, https://atlas.ripe.net/measurements/1790945/#!map https://atlas.ripe.net/measurements/1791307/#!map https://atlas.ripe.net/measurements/1791210/#!map [08:46:49] (WMDE office) [08:47:19] seems to work now, right? [08:47:34] sometime it does, soemtime it doesn't [08:47:42] let me check again [08:48:21] https://www.irccloud.com/pastebin/5Ig9h6EK/ [08:48:32] This one looks different ^ [08:48:44] ACKNOWLEDGEMENT - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdk failed https://phabricator.wikimedia.org/T135975 [08:48:52] but I'm not super familiar with traceroute [08:49:13] RECOVERY - Host mr1-esams.oob is UP: PING WARNING - Packet loss = 50%, RTA = 81.76 ms [08:49:39] Everything looks okay now to me [08:50:02] nod [08:50:10] looks like issues with Telia [08:50:23] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 14 failures [08:50:54] Hi could someone remove production branch from production Track Onlyin https://phabricator.wikimedia.org/diffusion/OPUP/manage/branches/ please [08:51:37] It will allow refs/changes/ to be processed [08:51:42] Please [08:51:57] Amir1: looks like it has recovered -- let me know if you experience any trouble again [08:52:12] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:52:54] paravoid: sure, thank you :) [08:53:13] PROBLEM - test icmp reachability to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 26 probes of 388 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [08:53:20] hey that's nice [08:53:33] a little late, but nice! [08:54:56] hehe indeed, that's how I noticed earlier, perhaps it wasn't critical for long enough to notify on irc, it was in the web interface tho [08:55:53] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:55:55] (03PS4) 10Filippo Giunchedi: DNS: Add mgmt DNS entries for ms-be2022 to ms-be2027 [dns] - 10https://gerrit.wikimedia.org/r/294543 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [08:56:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] DNS: Add mgmt DNS entries for ms-be2022 to ms-be2027 [dns] - 10https://gerrit.wikimedia.org/r/294543 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [08:56:52] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:56:58] sigh at our icinga dashboard :( [08:57:04] so many failures [08:57:25] (03PS3) 10Filippo Giunchedi: DHCP: Add MAC address entries for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [08:57:38] (03PS4) 10Filippo Giunchedi: DHCP: Add MAC address entries for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [08:57:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] DHCP: Add MAC address entries for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294863 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [08:58:33] godog: Icinga check that ripe-atlas-eqiad every 5 minutes. On the first two failures it set it in SOFT state which does not trigger notificaction. On the third that is a HARD start which does trigger notif [08:58:39] godog: history https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=ripe-atlas-eqiad&service=test+icmp+reachability+to+eqiad [08:59:09] godog: with first encounter at 08:41 utc and HARD state reached at 08:53 matching the time of the IRC message [09:00:22] hashar: indeed, thanks! yeah the rest for codfw/ulsfo has been to brief to go in HARD state and notify [09:00:32] check states are described on http://docs.icinga.org/latest/en/statetypes.html [09:00:37] and that is often a source of confusion :( [09:00:50] (03PS1) 10Jcrespo: Install jessie on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/294893 [09:01:32] (03PS2) 10Filippo Giunchedi: adding install params for ms-be202[2-7] Bug:T136630 [puppet] - 10https://gerrit.wikimedia.org/r/294866 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [09:01:42] (03PS3) 10Filippo Giunchedi: adding install params for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294866 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [09:01:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] adding install params for ms-be202[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/294866 (https://phabricator.wikimedia.org/T136630) (owner: 10Papaul) [09:02:55] (03PS2) 10Jcrespo: Install jessie on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/294893 [09:03:41] RECOVERY - test icmp reachability to eqiad on ripe-atlas-eqiad is OK: OK - failed 0 probes of 388 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [09:03:54] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2388214 (10Gehel) [09:04:20] (03PS2) 10Gehel: Adding gehel (Guillaume Lederrey) to icinga sms group [puppet] - 10https://gerrit.wikimedia.org/r/294486 [09:06:19] (03CR) 10Jcrespo: [C: 032] Install jessie on db1072 [puppet] - 10https://gerrit.wikimedia.org/r/294893 (owner: 10Jcrespo) [09:06:44] (03CR) 10Gehel: [C: 032] Adding gehel (Guillaume Lederrey) to icinga sms group [puppet] - 10https://gerrit.wikimedia.org/r/294486 (owner: 10Gehel) [09:07:41] (03PS3) 10Gehel: Adding gehel (Guillaume Lederrey) to icinga sms group [puppet] - 10https://gerrit.wikimedia.org/r/294486 [09:07:42] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [09:08:45] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388217 (10fgiunchedi) >>! In T136630#2387712, @Papaul wrote: > @fgiunchedi the other ms-be* systems have Trusty installed are we also installing Trusty on the new sys... [09:08:51] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [09:15:38] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:16:09] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:17:37] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:17:57] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:19:07] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:19:19] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:19:37] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:20:08] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:21:08] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:37] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:27:28] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:28] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:08] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:34] !log rolling reboot of mw1153,mw1155,mw1156 into new kernels [09:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:13] <_joe_> !log powercycling mw1140, OOMd [09:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:18] PROBLEM - Apache HTTP on mw1135 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.016 second response time [09:31:49] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:32:11] (03PS3) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 [09:32:13] (03PS29) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [09:32:56] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [09:33:15] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [09:33:17] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:33:46] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 70860 bytes in 1.646 second response time [09:33:46] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:34:07] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [09:34:26] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [09:34:36] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.077 second response time [09:34:56] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 11 % full [09:35:15] RECOVERY - DPKG on mw1140 is OK: All packages OK [09:35:26] RECOVERY - Disk space on mw1140 is OK: DISK OK [09:35:46] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [09:36:58] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2388249 (10akosiaris) So, this is happening due to carbon having a public IP and somewhat relax firewall rules. The premise is that hosts with public IPs have that due to t... [09:37:26] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:41:52] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2388256 (10Krenair) phab-01.phabricator.eqiad.wmflabs should no longer be doing it [09:50:07] (03CR) 10Paladox: [C: 04-1] "Rewrite rules need to be updated to https://phabricator.wikimedia.org/P3262 please." [puppet] - 10https://gerrit.wikimedia.org/r/294784 (https://phabricator.wikimedia.org/T137224) (owner: 10JanZerebecki) [09:53:13] (03PS3) 10JanZerebecki: Rewrite rules for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4) [09:53:16] (03Abandoned) 10JanZerebecki: Add gitblit compatibility apache vhost to phabricator [puppet] - 10https://gerrit.wikimedia.org/r/294784 (https://phabricator.wikimedia.org/T137224) (owner: 10JanZerebecki) [09:54:13] (03CR) 10JanZerebecki: "Merged in some pieces of https://gerrit.wikimedia.org/r/#/c/294867/" [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4) [09:55:48] (03CR) 10Paladox: [C: 031] "Thankyou." [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4) [10:02:02] (03CR) 10Danny B.: [C: 031] Rewrite rules for git.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4) [10:24:03] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388330 (10mmodell) Greg: I guess https://gerrit.wikimedia.org/r/#/c/294867/ supersedes the others. We also need to... [10:32:54] 06Operations: setup syslog server in codfw - https://phabricator.wikimedia.org/T138073#2388340 (10fgiunchedi) [10:38:14] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388379 (10Paladox) @mmodell I guess we can ask ops to do the dns bit please. [10:39:18] 06Operations, 10procurement: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2388384 (10fgiunchedi) [10:39:53] (03CR) 10JanZerebecki: Rewrite rules for git.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294867 (https://phabricator.wikimedia.org/T137224) (owner: 1020after4) [10:47:03] 06Operations, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2388405 (10Peachey88) [10:50:58] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388406 (10JanZerebecki) >>! In T137224#2388330, @mmodell wrote: > Greg: I guess https://gerrit.wikimedia.org/r/#/c/2... [10:54:14] (03CR) 10JanZerebecki: "I have seen no one ask for it not being on iridium, so I think it is good that this is in the same patch. (If the tests were being execute" [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [10:56:10] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2388427 (10ArielGlenn) https://github.com/facebook/hhvm/issues/7075 has just been closed after this commit: https://github.com/facebook/hhvm/commit/9d2be6c3... [10:59:05] (03CR) 10Paladox: [C: 031] varnish: git.wm.org to antimony, remove git-related config/tests [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [11:00:11] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388438 (10BBlack) I don't even understand the initial thing the ticket is complaining about. Can you explain to... [11:00:55] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2388442 (10akosiaris) [11:00:57] 06Operations, 10vm-requests: EQIAD: (1) VM request for url-downloader - https://phabricator.wikimedia.org/T134496#2388441 (10akosiaris) [11:05:42] (03CR) 10JanZerebecki: [C: 04-1] varnish: git.wm.org to antimony, remove git-related config/tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [11:10:51] (03CR) 10JanZerebecki: varnish: git.wm.org to antimony, remove git-related config/tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [11:13:26] (03CR) 10Paladox: "@JanZerebecki would you be able todo ^^ please since @dzahn is on holiday." [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [11:14:06] !log stopping puppet on hosts using service::node (restbase, sca, scb, aqs) for step-by-step rollout of two puppet patches for firejail/service::node [11:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:16:59] (03PS1) 10Jcrespo: Install jessie on all db eqiad hosts > db1050 [puppet] - 10https://gerrit.wikimedia.org/r/294904 [11:17:03] (03PS5) 10Muehlenhoff: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [11:19:24] (03PS1) 10Jcrespo: Repool db1072 with low weight, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294905 [11:20:20] (03CR) 10Jcrespo: [C: 032] Install jessie on all db eqiad hosts > db1050 [puppet] - 10https://gerrit.wikimedia.org/r/294904 (owner: 10Jcrespo) [11:20:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [11:20:33] (03PS6) 10Muehlenhoff: services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [11:20:42] (03CR) 10Muehlenhoff: [V: 032] services firejail: make fs blacklist more obvious [puppet] - 10https://gerrit.wikimedia.org/r/293515 (owner: 10JanZerebecki) [11:21:48] I think there is a race condition here [11:22:06] mine is merged now, so should be resolved :-) [11:22:15] no, that is the bad thing [11:23:00] look https://phabricator.wikimedia.org/P3266 [11:23:20] it merges more than promised [11:23:57] and it doesn't warn there are 2 changes [11:24:09] weird, puppet-merge only displayed me my change [11:24:32] mmm [11:24:41] maybe you merged at the same time? [11:25:14] so running 2 is "transactional"? [11:25:15] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2224 [11:25:28] could be given the timing our the gerrit-wm output [11:25:50] also, your merge did not complain [11:25:55] on gerrit [11:28:13] (03PS2) 10Muehlenhoff: Remove --tmpfs option in service::node and zotero [puppet] - 10https://gerrit.wikimedia.org/r/294483 (https://phabricator.wikimedia.org/T121756) [11:28:40] jynus: let's create a task, so that it can be re-checked with the new puppet next quarter? [11:30:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove --tmpfs option in service::node and zotero [puppet] - 10https://gerrit.wikimedia.org/r/294483 (https://phabricator.wikimedia.org/T121756) (owner: 10Muehlenhoff) [11:32:53] there's definitely no transactionality to what puppet-merge does [11:34:07] maybe add a .lock ? [11:34:37] but this issue is different, is accepting one change and merging 2 [11:35:10] so it is that, and if the changes accepted != the ones about to merge, quit [11:35:15] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 307176 Threads: 1 Questions: 3741336 Slow queries: 2072 Opens: 651 Flush tables: 2 Open tables: 572 Queries per second avg: 12.179 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:37:36] (03CR) 10Alexandros Kosiaris: [C: 031] "I am seeing 24G for /srv, 80K for /var/lib/zuul, and /var/lib/jenkins is a 512M tempfs.." [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [12:04:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Actually no, /var/lib/jenkins/tmpfs is a tmpfs... still trying to run the numbers..." [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [12:08:05] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [12:08:15] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [12:08:34] (03PS5) 10Nikerabbit: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [12:09:49] (03CR) 10Nikerabbit: [C: 031] "Code looks good, did not test." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [12:14:29] (03PS1) 10Ctrochalakis: zookeeper::jmxtrans: Expose statsd & group_prefix parameters [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/294906 [12:15:18] (03CR) 10Alexandros Kosiaris: "so," [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [12:16:38] 06Operations, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388494 (10BBlack) nice find in the old tickets! See also: T120121 [12:18:19] hasharAway: got a question for you in https://gerrit.wikimedia.org/r/293690 [12:27:27] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 70840 bytes in 9.964 second response time [12:27:36] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.431 second response time [12:27:53] !log restarted hhvm on mw1133 and mw1135 [12:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:28:36] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.092 second response time [12:29:45] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 71402 bytes in 0.904 second response time [12:31:55] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:32:46] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:39:01] (03PS1) 10Catrope: Enable Flow beta feature on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294909 (https://phabricator.wikimedia.org/T138064) [12:47:56] (03CR) 10Jcrespo: [C: 032] Repool db1072 with low weight, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294905 (owner: 10Jcrespo) [12:49:04] !log rolling reboot of mw1157-mw1160 into new kernels [12:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:49:53] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 with low weight, depool db1073 (duration: 00m 27s) [12:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:54:38] (03PS1) 10Jcrespo: Increase db1072 weight after repooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294910 [12:57:29] !log stopping, backuping and reimaging db1073 [12:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:08] 06Operations, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388543 (10Yurik) We should move all proxies out of the zerowiki and into metawiki. This will allow a much more transparent management of the proxy ips, and won't as... [13:02:20] 06Operations, 10Analytics-Cluster, 10Traffic: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2388560 (10BBlack) >>! In T56783#2388543, @Yurik wrote: > We should move all proxies out of the zerowiki and into metawiki. This will allow a much more transparent ma... [13:02:36] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:47] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2388568 (10Gehel) [13:10:42] (03CR) 10Alexandros Kosiaris: [C: 031] systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 (owner: 10Giuseppe Lavagetto) [13:25:09] (03PS2) 10JanZerebecki: varnish: git.wm.org to iridium, remove related config/tests/monitoring [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [13:26:06] (03CR) 10JanZerebecki: varnish: git.wm.org to iridium, remove related config/tests/monitoring (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [13:27:25] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:42:47] (03PS1) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 [13:43:59] (03CR) 10jenkins-bot: [V: 04-1] Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 (owner: 10Gehel) [13:46:10] (03PS2) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 [13:47:24] (03CR) 10Paladox: [C: 031] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [13:52:40] (03PS1) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) [13:53:54] (03CR) 10jenkins-bot: [V: 04-1] etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) (owner: 10Giuseppe Lavagetto) [13:57:08] (03PS2) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) [13:57:25] <_joe_> akosiaris: did I use the backup classes correctly? [13:57:26] 06Operations: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2388637 (10MoritzMuehlenhoff) >>! In T135991#2377176, @ArielGlenn wrote: > I'm fine with a daily salt-minion restart but let's make sure that it doesn't leave a duplicate (old) salt-minion ru... [13:59:16] 06Operations: ntp restart sometimes unrealiable - https://phabricator.wikimedia.org/T126733#2388640 (10MoritzMuehlenhoff) [13:59:18] 06Operations: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2388639 (10MoritzMuehlenhoff) [13:59:33] 06Operations: Restarts of ganglia-monitor are unreliable - https://phabricator.wikimedia.org/T135723#2388642 (10MoritzMuehlenhoff) [13:59:35] 06Operations: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10MoritzMuehlenhoff) [14:03:51] (03PS1) 10KartikMistry: apertium-eo-fr: New upstream release and Jessie rebuild [debs/contenttranslation/apertium-eo-fr] - 10https://gerrit.wikimedia.org/r/294917 (https://phabricator.wikimedia.org/T107306) [14:07:40] 06Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 4 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2388644 (10KartikMistry) [14:07:53] (03PS1) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 [14:08:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] etcd: perform backups to /srv/backups/etcd, bacula (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) (owner: 10Giuseppe Lavagetto) [14:08:46] _joe_: I 've commented [14:10:04] (03CR) 10Gehel: Configuration for new elasticsearch servers in eqiad. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [14:11:16] <_joe_> thanks :) [14:11:31] (03CR) 10Alexandros Kosiaris: [C: 031] "+1 with a small minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294914 (owner: 10Gehel) [14:13:15] akosiaris: where did you find the LVS IP for maps? (and where can I find it myself next time). [14:13:24] (03PS1) 10Faidon Liambotis: ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 [14:13:26] (03PS1) 10Faidon Liambotis: ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 [14:14:40] (03CR) 10Faidon Liambotis: [C: 032] ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 (owner: 10Faidon Liambotis) [14:14:43] (03PS3) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 [14:14:45] (03CR) 10jenkins-bot: [V: 04-1] ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 (owner: 10Faidon Liambotis) [14:14:47] (03CR) 10Faidon Liambotis: [C: 032] ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 (owner: 10Faidon Liambotis) [14:15:01] (03CR) 10Jcrespo: [C: 032] Increase db1072 weight after repooling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294910 (owner: 10Jcrespo) [14:15:02] <_joe_> jenkins doesn't like us [14:15:11] (03CR) 10jenkins-bot: [V: 04-1] ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 (owner: 10Faidon Liambotis) [14:15:14] (03PS4) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 [14:15:16] (03PS30) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:15:19] well it's right :) [14:15:38] <_joe_> paravoid: it's clearly biased [14:15:48] <_joe_> inconscious bias it is! [14:15:56] (03PS2) 10Faidon Liambotis: ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 [14:15:58] (03PS2) 10Faidon Liambotis: ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 [14:16:14] gehel: https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/templates/wmnet$4213 and https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/templates/10.in-addr.arpa$51 [14:16:22] I 've already reserved it for you :-) [14:16:32] akosiaris: thanks! [14:16:41] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase db1072 weight after repooling (duration: 00m 36s) [14:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:47] Could not start Service[isc-dhcp-server]: Execution of '/sbin/start isc-dhcp-server' returned 1 [14:20:57] (03CR) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) (owner: 10Giuseppe Lavagetto) [14:21:27] (03PS3) 10Giuseppe Lavagetto: etcd: perform backups to /srv/backups/etcd, bacula [puppet] - 10https://gerrit.wikimedia.org/r/294916 (https://phabricator.wikimedia.org/T135129) [14:23:58] I broke it [14:24:02] and I will fix it [14:24:26] (03CR) 10Faidon Liambotis: [C: 032] ripeatlas: add IPv6 measurements as well [puppet] - 10https://gerrit.wikimedia.org/r/294919 (owner: 10Faidon Liambotis) [14:24:32] (03CR) 10Faidon Liambotis: [C: 032] ripeatlas: lower intervals to 5/1 from 10/5 [puppet] - 10https://gerrit.wikimedia.org/r/294920 (owner: 10Faidon Liambotis) [14:26:03] this is the first time in a long time where logs were immediately useful [14:26:16] we should celebrate it! [14:28:23] (03PS1) 10Jcrespo: Fix missing '}' after entry [puppet] - 10https://gerrit.wikimedia.org/r/294921 [14:29:02] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388656 (10Mholloway) @ori, thank you, those are excellent points and I've created a new task to address them.... [14:29:31] 06Operations, 06Revision-Scoring-As-A-Service, 06Services, 07service-deployment-requests: New Service Request: ORES - https://phabricator.wikimedia.org/T117560#2388657 (10akosiaris) 05Open>03Resolved a:03akosiaris Resolving since ORES has been in production for the past 2 weeks [14:30:26] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2388660 (10BBlack) @mholloway - any chance this has interactions with the lazy-images/lazy-refs experiments, whic... [14:30:57] (03PS5) 10Giuseppe Lavagetto: systemd: add systemd::sidekick [puppet] - 10https://gerrit.wikimedia.org/r/291949 [14:31:08] (03CR) 10Jcrespo: [C: 032] Fix missing '}' after entry [puppet] - 10https://gerrit.wikimedia.org/r/294921 (owner: 10Jcrespo) [14:32:36] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [14:33:35] (03PS1) 10Mobrovac: Revert "Revert "Change Prop: Disable transclusion update rules"" [puppet] - 10https://gerrit.wikimedia.org/r/294922 [14:34:06] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:29] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Revert "Change Prop: Disable transclusion update rules"" [puppet] - 10https://gerrit.wikimedia.org/r/294922 (owner: 10Mobrovac) [14:34:48] heh "puppet fail" flapping on restbase1007 is https://phabricator.wikimedia.org/T137952, I've silenced the alarm now but I'm not sure how to best fix it, update the cassandra version in puppet perhaps [14:35:32] godog: perhaps have puppet ensure the version that should applied on each node (2.1 vs 2.2)? [14:36:31] godog: oh, no, no no ensure, a custom 2.2 deb has been installed on rb1007 [14:37:29] mobrovac: ah yeah, now I get it, will fix in puppet [14:40:29] (03PS1) 10Filippo Giunchedi: cassandra: default to 2.2.6-wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) [14:46:51] (03CR) 10Eevans: "LGTM, but could we remove the cassandra::version overrides in hieradata/regex.yaml as part of the same changeset?" [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) (owner: 10Filippo Giunchedi) [14:51:00] (03CR) 10Thcipriani: [C: 031] Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294874 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [14:55:08] !log scb disabling puppet for stopping change-prop to clear transclusion queues [14:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:25] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:58:18] (03PS2) 10Alexandros Kosiaris: Remove nas1001-a.eqiad.wmnet, nas1001-b.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/267056 (https://phabricator.wikimedia.org/T124156) [14:59:34] (03CR) 10Alexandros Kosiaris: [C: 032] Remove nas1001-a.eqiad.wmnet, nas1001-b.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/267056 (https://phabricator.wikimedia.org/T124156) (owner: 10Alexandros Kosiaris) [15:00:04] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#2388690 (10akosiaris) [15:00:06] 06Operations, 10Traffic: unix domain socket listening for varnish4 - https://phabricator.wikimedia.org/T138084#2388691 (10BBlack) [15:00:55] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.16, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:03:31] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:03:31] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [15:07:32] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [15:07:41] (03CR) 10Hashar: "All of gallium is in puppet/configuration management BUT the Jenkins configuration. The whole point is thus to backup /var/lib/jenkins" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [15:08:03] akosiaris: I have replied about gallium backup. In short /var/lib/jenkins is what we want to backup, but we would want to exclude the builds history [15:22:25] (03PS1) 10Muehlenhoff: Restart exim daily on Monday to Friday [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) [15:22:55] (03CR) 10Hashar: "The bacula fileset configuration is expanded from modules/bacula/templates/bacula-dir-fileset.erb . The Exclude are defined with File keyw" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [15:34:08] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388783 (10Mattflaschen-WMF) By timestamp: Converted from UUID with: ``` Flow\Model\UUID::create( strtolower( ' 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2388786 (10BBlack) New usernames in the past 24H: ``` Raboe001 ``` (I figure no point repeating the already-notified list every day, but c... [15:44:52] ACKNOWLEDGEMENT - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 2 failures Giuseppe Lavagetto Both jobrunner and jobchron are masked while I wait to activate the jessie jobrunner [15:54:05] hasharAway: yeah looking right now [15:54:58] !log Restarting Cassandra on xenon.eqiad.wmnet to enable large pages : T137419 [15:54:59] T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419 [15:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:56] (03CR) 10Paladox: "There now done at https://gerrit.wikimedia.org/r/#/c/294867/" [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [15:56:05] (03CR) 10Paladox: [C: 04-1] git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [15:56:18] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388820 (10jcrespo) The question was because, unlike beta, the setup is a bit more complex: we have read-write and... [15:59:30] !log Starting html dumps from xenon.eqiad.wmnet and cerium.eqiad.wmnet : T137419 [15:59:31] T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419 [15:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:51] (03CR) 10Alexandros Kosiaris: "@hashar: I see what you mean. I 'll look into enabling the wilddir option" [puppet] - 10https://gerrit.wikimedia.org/r/293690 (https://phabricator.wikimedia.org/T80385) (owner: 10Muehlenhoff) [16:00:59] (03PS2) 10Filippo Giunchedi: cassandra: default to 2.2.6-wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) [16:03:13] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) (owner: 10Filippo Giunchedi) [16:06:30] 06Operations, 10ops-codfw, 10media-storage: codfw: rack/setup/deploy ms-be202[2-7] switch configuration - https://phabricator.wikimedia.org/T138052#2388842 (10RobH) [16:08:54] 06Operations, 10ops-codfw, 10media-storage: codfw: rack/setup/deploy ms-be202[2-7] switch configuration - https://phabricator.wikimedia.org/T138052#2388846 (10RobH) 05Open>03Resolved switch config updated [16:08:56] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388848 (10RobH) [16:09:12] (03PS1) 10Muehlenhoff: Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 [16:10:06] (03PS1) 10Jcrespo: Repool db1073 with low weight after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294931 [16:12:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "PCC says yes: https://puppet-compiler.wmflabs.org/3142/" [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) (owner: 10Filippo Giunchedi) [16:13:01] (03PS1) 10RobH: adding production dns entries for ms-be202[2-7] [dns] - 10https://gerrit.wikimedia.org/r/294932 [16:13:03] (03PS3) 10Filippo Giunchedi: cassandra: default to 2.2.6-wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) [16:13:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: default to 2.2.6-wmf1 [puppet] - 10https://gerrit.wikimedia.org/r/294924 (https://phabricator.wikimedia.org/T137952) (owner: 10Filippo Giunchedi) [16:14:20] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/3143/" [puppet] - 10https://gerrit.wikimedia.org/r/294930 (owner: 10Muehlenhoff) [16:14:34] (03CR) 10RobH: [C: 032] adding production dns entries for ms-be202[2-7] [dns] - 10https://gerrit.wikimedia.org/r/294932 (owner: 10RobH) [16:16:29] 06Operations, 13Patch-For-Review: "puppet fail" flapping on restbase1007 - https://phabricator.wikimedia.org/T137952#2388869 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi fixed in https://gerrit.wikimedia.org/r/294924 [16:17:41] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388873 (10RobH) So these are HP with HW raid controllers. We'll need @papaul to setup the raid10 of the primary spinning disks for these before we can proceed with i... [16:21:00] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388901 (10Paladox) I guess we remove https://github.com/wikimedia/operations-puppet/blob/52737634512bf43f8f98b757be4... [16:21:22] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2388902 (10Paladox) But we will need to update git.wikimedia.org ip to use iridium. [16:22:51] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [16:25:23] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388907 (10fgiunchedi) @robh, yeah same as other the others, namely raid configuration for ms-be is all disks in raid0 from the hw controller, similarly to https://pha... [16:27:16] 06Operations, 10ops-codfw, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be202[2-7] - https://phabricator.wikimedia.org/T136630#2388911 (10RobH) So I was wrong, no hw raid setup other than they need to present as individual disks to the OS. [16:29:31] !log installing squid security updates on carbon [16:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:52] (03PS4) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 (https://phabricator.wikimedia.org/T135018) [16:33:52] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291710 (owner: 10Filippo Giunchedi) [16:34:07] (03PS1) 10Dereckson: Allow sysops to add to/remove from confirmed on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294933 (https://phabricator.wikimedia.org/T138069) [16:38:37] 06Operations, 06Discovery, 06Maps: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2388933 (10Gehel) [16:39:04] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2285911 (10Gehel) Closing this. The configuration is tracked in T138092. [16:39:27] (03PS5) 10Gehel: Configuration for new maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/294914 (https://phabricator.wikimedia.org/T138092) [16:44:55] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2388961 (10BBlack) [16:46:54] (03PS5) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [16:47:03] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:47:26] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [16:47:48] (03PS6) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [16:49:15] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2388975 (10BBlack) For reference, here's the raw data at the end of some munging of a 5-minute sample of received URLs in a single cache... [16:49:58] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388976 (10Mattflaschen-WMF) Yeah. I checked what the first use of External Store was just now: Seems it was use... [16:50:19] (03PS4) 10Filippo Giunchedi: prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 [16:56:58] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2388983 (10jcrespo) That is good news, if I understand it correctly it means there is no revisions in cluster23 or... [17:08:04] (03CR) 10Luke081515: [C: 031] Allow sysops to add to/remove from confirmed on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294933 (https://phabricator.wikimedia.org/T138069) (owner: 10Dereckson) [17:08:18] (03CR) 1020after4: [C: 031] "the lists should probably be generated but this is fine for now." [puppet] - 10https://gerrit.wikimedia.org/r/294742 (https://phabricator.wikimedia.org/T110068) (owner: 10Thcipriani) [17:09:27] (03CR) 10Ori.livneh: prometheus: add nginx reverse proxy (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [17:12:43] (03CR) 10Ori.livneh: "actually could you get away with getting rid of the erb file altogether by using a $title-agnostic regexp, like location ~ /(\-|debug) ?" [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [17:14:58] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2389056 (10JanZerebecki) [17:18:48] (03CR) 1020after4: [C: 031] varnish: git.wm.org to iridium, remove related config/tests/monitoring [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [17:19:48] (03CR) 1020after4: "this is good to go as soon as I67ad308f9e6373e5234cb2d83006457d6f467bf8 is deployed" [puppet] - 10https://gerrit.wikimedia.org/r/293789 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [17:22:01] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389079 (10mmodell) So we need these two to merge in the listed order: 1. https://gerrit.wikimedia.org/r/#/c/293789/... [17:23:05] 06Operations, 06Collaboration-Team-Interested, 10DBA, 10Flow, 07WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610#2389080 (10matthiasmullie) Indeed: ``` mysql:wikiadmin@10.64.16.18 [flowdb]> SELECT DISTINCT SUBSTR(rev_content,... [17:24:32] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [17:34:45] (03Abandoned) 10Faidon Liambotis: interface: disable IPv6 autoconf on precise hosts [puppet] - 10https://gerrit.wikimedia.org/r/217317 (owner: 10Faidon Liambotis) [17:38:42] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389143 (10Paladox) @mmodell do we send an email out on wikitech-I saying git.wikimedia.org will be redirected soon. [17:40:14] (03CR) 10Faidon Liambotis: [C: 031] "Nice! I'd very much prefer to ditch these two variables but until we do so, this will do." [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris) [17:42:10] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389149 (10Aklapper) If that was a question the sentence has to end with a question mark. Always. [17:42:56] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389150 (10Paladox) @Aklapper ok sorry, done. [17:43:19] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389151 (10mmodell) lol [17:50:12] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:53:46] (03CR) 10Faidon Liambotis: [C: 04-1] network: add $production_networks (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [18:00:41] (03PS2) 10Paladox: Block access to jice.ddns.net instead of ip [puppet] - 10https://gerrit.wikimedia.org/r/277904 [18:04:04] (03PS1) 10Faidon Liambotis: otrs: add check_procs for clamd/freshclam [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) [18:07:38] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2389616 (10greg) Yeah, we should announce before it happens. [18:13:25] 06Operations, 13Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10faidon) A few of these could be quite problematic, unfortunately. I'd proceed with caution. - ntpd: we have Icinga checks for that I think, so they might trip... [18:30:25] (03CR) 10Jcrespo: [C: 032] Repool db1073 with low weight after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294931 (owner: 10Jcrespo) [18:32:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1073 with low weight after reimage (duration: 00m 35s) [18:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:53:20] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2390465 (10Paladox) We also need to remove https://github.com/wikimedia/operations-puppet/blob/bdd27ef834044a25c5a4e6... [18:56:35] !log Restarting Cassandra on xenon.eqiad.wmnet with -XX:+PreserveFramePointer : T137419 [18:56:36] T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419 [18:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:14] (03PS1) 1020after4: Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 [19:29:37] (03PS2) 1020after4: Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 [19:31:24] (03CR) 1020after4: [C: 031] "Phabricator has since fixed the issue that necessitated adding this in the first place so it's safe to remove." [puppet] - 10https://gerrit.wikimedia.org/r/294945 (owner: 1020after4) [19:33:50] (03CR) 10Paladox: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/294945 (owner: 1020after4) [19:35:08] (03PS3) 1020after4: Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 (https://phabricator.wikimedia.org/T137819) [19:47:44] 06Operations, 10netops: Notify transits of new esams prefixes - https://phabricator.wikimedia.org/T81989#2390595 (10faidon) 05Open>03Resolved a:03faidon I verified via looking glasses that those routes are being successfully propagated by all of our Amsterdam transits but one. I've mailed them for comple... [20:15:11] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: puppet fail [20:20:04] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390668 (10Mholloway) Mystery solved: due to a session logging schema change, the Android app began sending sessi... [20:20:11] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 272 seconds ago with 0 failures [20:20:21] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390669 (10Mholloway) 05Open>03Resolved [20:23:27] !log maxsem@tin Synchronized php-1.28.0-wmf.6/extensions/WikimediaEvents/: https://gerrit.wikimedia.org/r/#/c/294958/ (duration: 00m 33s) [20:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:41] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: puppet fail [20:35:51] !log Disabling puppet on xenon.eqiad.wmnet : T137419 [20:35:52] T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419 [20:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:04] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2390675 (10hashar) [20:39:11] !log Restarting Cassandra on xenon.eqiad.wmnet to apply -XX:+PreserveFramePointer : T137419 [20:39:12] T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419 [20:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:47:45] !sal [20:47:45] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [20:48:36] (03PS1) 10BBlack: tlsproxy: drop ssl cache size back to 1G [puppet] - 10https://gerrit.wikimedia.org/r/295006 [20:49:04] MaxSem: the schema.geoFormat issue is gone for me ( https://phabricator.wikimedia.org/T138078 ) [20:49:15] yeah [20:49:16] MaxSem: probably want to reply a nice thing to the bug filler and close the task :) [20:49:32] (03CR) 10BBlack: [C: 032 V: 032] tlsproxy: drop ssl cache size back to 1G [puppet] - 10https://gerrit.wikimedia.org/r/295006 (owner: 10BBlack) [20:49:33] MaxSem: I guess thank you for the patch [20:49:36] "I suck" :P [20:52:02] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 5 failures [20:57:09] (03PS1) 10BBlack: cache_upload: experiment with 4h fe ttl cap [puppet] - 10https://gerrit.wikimedia.org/r/295007 (https://phabricator.wikimedia.org/T124954) [20:57:32] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:57:41] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: experiment with 4h fe ttl cap [puppet] - 10https://gerrit.wikimedia.org/r/295007 (https://phabricator.wikimedia.org/T124954) (owner: 10BBlack) [21:00:03] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:00:27] MaxSem: I disagree! You have noticed the task, figured out it was an easy fix, asked to deploy, get a review and got it pushed :) [21:01:10] get a nice word / summary and it is all set! [21:02:52] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:58] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [21:06:46] MaxSem: sleep well :) [21:08:09] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:11:39] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:12:19] PROBLEM - puppet last run on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:16:08] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [21:21:45] !log Reenabling puppet and resetting configuration on xenon.eqiad.wmnet : T137419 [21:21:46] T137419: Investigate aberrant disk read throughput in Cassandra 2.2.6 - https://phabricator.wikimedia.org/T137419 [21:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:08] RECOVERY - restbase endpoints health on xenon is OK: All endpoints are healthy [21:24:49] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 51 minutes ago with 0 failures [21:30:21] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2390792 (10Paladox) @greg do you know who would send the email out? [21:45:30] (03PS1) 10Paladox: Only mirror refs/heads/ and refs/tags/ for mw core and operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/295011 [21:47:44] (03CR) 10Paladox: "Were switching mirrors on and this is required for the biggest repo's." [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox) [21:52:08] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [22:00:08] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5297808 keys - replication_delay is 610 [22:04:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5244813 keys - replication_delay is 0 [22:05:58] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2390881 (10Southparkfan) @Joe mw1306 seems to have the same IP as mw1091 (although that one shows up as mw1091 in Ganglia, whereas mw1090 shows up as mw1305...). So I... [22:10:14] 06Operations, 10Android-app-feature-Feeds, 10Mobile-Content-Service, 10RESTBase, and 3 others: Investigate Android app API request latency regression - https://phabricator.wikimedia.org/T138010#2390885 (10GWicke) This is the [updated latency graph](https://docs.google.com/spreadsheets/d/1ZcaXdxMhaAEFMferqi... [22:11:56] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2390889 (10greg) One of me, Mukunda, or Chad, probably. [22:18:05] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2390924 (10Paladox) Ok Thanks for replying. [22:45:29] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [23:29:55] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [23:29:55] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: Zayo (IPYX/125449/003/ZYO) {#11542} [10Gbps]BR [23:30:26] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [23:38:35] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [23:38:35] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [23:39:06] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0