[00:20:58] <mobrovac>	 !log restbase deploying 7c753fe6
[00:21:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:03] <icinga-wm>	 PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:32:13] <icinga-wm>	 PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed
[00:32:30] * volans looking
[00:33:23] <volans>	 oom-killer
[00:36:03] <icinga-wm>	 RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational
[00:36:13] <icinga-wm>	 RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active
[00:36:26] <volans>	 !log restarted carbon-cache@c on graphite1003 (was killed by oom-killer)
[00:36:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:39:23] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[00:42:01] <wikibugs>	 06Operations, 10Graphite, 10Monitoring: Fix permissions for systemd file - https://phabricator.wikimedia.org/T155869#2957345 (10Volans)
[00:42:17] <wikibugs>	 06Operations, 10Monitoring: Fix permissions for systemd file - https://phabricator.wikimedia.org/T155869#2957358 (10Volans)
[00:59:54] <wikibugs>	 06Operations, 10Parsoid, 13Patch-For-Review, 15User-mobrovac: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2957367 (10mobrovac) >>! In T155768#2955291, @Joe wrote: > We might need to amend service-runner to be able to rotate logs better.  Sending a SIGHUP to a service-runner master pr...
[01:06:23] <icinga-wm>	 PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100%
[01:06:33] <icinga-wm>	 PROBLEM - Host analytics1031 is DOWN: PING CRITICAL - Packet loss = 100%
[01:06:34] <icinga-wm>	 PROBLEM - Host analytics1029 is DOWN: PING CRITICAL - Packet loss = 100%
[01:06:34] <icinga-wm>	 PROBLEM - Host analytics1028 is DOWN: PING CRITICAL - Packet loss = 100%
[01:06:34] <icinga-wm>	 PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100%
[01:06:43] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/0/2: down - Core: asw-c-eqiad:xe-2/0/0 {#3458} [10Gbps DF]BR
[01:06:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 208, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/0/2: down - Core: asw-c-eqiad:xe-2/1/2 {#3464} [10Gbps DF]BR
[01:06:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: Generic error: paths
[01:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200): /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200): /pag
[01:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200)
[01:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200)
[01:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200): /data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaT
[01:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:08:06] <icinga-wm>	 RECOVERY - configured eth on lvs1003 is OK: OK - interfaces up
[01:08:06] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[01:08:23] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[01:08:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy
[01:08:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[01:08:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[01:08:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[01:08:53] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[01:09:03] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[01:10:04] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[01:10:13] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[01:12:14] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[01:12:28] <ema>	 looking, it seems like a transient network issue in eqiad ^
[01:13:18] <mobrovac>	 cr1 and cr2 are alerting ^
[01:14:03] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[01:17:03] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
[01:18:13] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[01:19:13] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[01:20:50] <logmsgbot>	 !log mobrovac@tin Starting deploy [changeprop/deploy@eb27062]: (no message)
[01:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:21:27] <Krenair>	 mobrovac, this is not the best timing...
[01:21:54] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@eb27062]: (no message) (duration: 01m 03s)
[01:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:03] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[01:22:13] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[01:31:43] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:38:28] <wikibugs>	 06Operations, 10Graphite, 10Monitoring: graphite1003 short of available RAM - https://phabricator.wikimedia.org/T155872#2957436 (10Volans)
[01:52:33] <icinga-wm>	 PROBLEM - configured eth on labstore1004 is CRITICAL: eth1 reporting no carrier.
[01:54:15] <volans>	 expected, ops is working on it
[01:54:33] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[01:59:43] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[02:01:33] <icinga-wm>	 RECOVERY - configured eth on labstore1004 is OK: OK - interfaces up
[02:02:01] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate failed: git pull of extensions failed
[02:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:22:53] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[02:35:23] <icinga-wm>	 PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:03:23] <icinga-wm>	 RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[03:06:16] <wikibugs>	 06Operations, 07Puppet, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2757207 (10scfc) +1.  The last time this bugged me I thought maybe the Puppet roles were re-read from the filesystem each time, looked at the source (`modules/openstack/files/liberty/horizon/puppet...
[03:28:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 791.23 seconds
[03:35:03] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.06 seconds
[03:54:39] <wikibugs>	 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2957555 (10faidon)
[03:56:35] <wikibugs>	 (03CR) 10Tim Landscheidt: "This seems to be working only on Jessie (python-keystoneclient 2.3.1-3~bpo8+1, pulled from http://apt.wikimedia.org/wikimedia/), but not T" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[04:00:13] <icinga-wm>	 PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:02:13] <icinga-wm>	 PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed
[04:02:34] <ema>	 carbon-cache@c killed by OOM killer again ^
[04:03:48] <ema>	 !log graphite1003: carbon-cache@c restarted, it's been killed by OOM killer again
[04:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:04:13] <icinga-wm>	 RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational
[04:04:13] <icinga-wm>	 RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active
[04:13:39] <wikibugs>	 (03CR) 10Tim Landscheidt: "Sorry, I got confused.  On toolsbeta-puppetmaster7, expand_path is set to common for hiera, thus setting openstack::version gets set to "l" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[04:29:03] <icinga-wm>	 PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:29:13] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed
[04:32:36] <wikibugs>	 06Operations, 10Graphite: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957609 (10ema) p:05Triage>03High
[04:33:58] <wikibugs>	 06Operations, 10Graphite: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957597 (10ema)
[04:39:03] <icinga-wm>	 RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational
[04:39:13] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active
[04:49:45] <wikibugs>	 (03CR) 10Tim Landscheidt: "Actually: No.  After monkey-patching $::openstack::version (it is also set in modules/openstack/manifests/init.pp), Puppet fails because i" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[04:53:32] <wikibugs>	 (03PS3) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892
[04:54:51] <wikibugs>	 (03CR) 10Tim Landscheidt: [C: 04-1] "(Haven't tested after the change.)" [puppet] - 10https://gerrit.wikimedia.org/r/326892 (owner: 10Tim Landscheidt)
[04:55:23] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:15:43] <icinga-wm>	 PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:22:53] <wikibugs>	 (03CR) 10Andrew Bogott: "Are you maybe missing an apt-get update run between puppet runs?  I tested quite a lot of this on Trusty and tested it again just now.  I " [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[05:24:23] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[05:27:48] <wikibugs>	 (03CR) 10Andrew Bogott: "(Oh, said instance is packagetest.testlabs.eqiad.wmflabs, you probably have access.)" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk)
[05:33:42] <wikibugs>	 06Operations, 10Graphite: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957616 (10Volans) See also T155872
[05:41:00] <wikibugs>	 06Operations, 10Graphite, 10Monitoring: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957618 (10ema)
[05:44:43] <icinga-wm>	 RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:01:43] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[06:24:33] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[06:29:43] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[06:33:43] <icinga-wm>	 PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[06:35:45] <wikibugs>	 (03PS1) 10Madhuvishy: nfs: Move backups to secondary DC to different times [puppet] - 10https://gerrit.wikimedia.org/r/333327
[06:37:23] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:39:23] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:41:23] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:43:33] <icinga-wm>	 PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata]
[06:48:33] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:52:33] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[07:02:43] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[07:10:33] <icinga-wm>	 RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[07:19:33] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK
[07:32:23] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK
[08:00:23] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK
[08:31:15] <wikibugs>	 (03CR) 10Hashar: "I wasn't aware of -x which is:" [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) (owner: 10Hashar)
[09:22:23] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK
[10:00:13] <icinga-wm>	 PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:14:52] <wikibugs>	 (03PS8) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645)
[10:15:07] <wikibugs>	 (03PS7) 10Juniorsys: mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645)
[10:15:17] <wikibugs>	 (03PS7) 10Juniorsys: postgresql module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332104 (https://phabricator.wikimedia.org/T93645)
[10:15:43] <wikibugs>	 (03PS7) 10Juniorsys: toollabs role modules: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332110 (https://phabricator.wikimedia.org/T93645)
[10:15:54] <wikibugs>	 (03PS7) 10Juniorsys: toollabs module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332111
[10:19:49] <wikibugs>	 (03CR) 10Juniorsys: [C: 031] kartotherian: optional parameter listed before required [puppet] - 10https://gerrit.wikimedia.org/r/332956 (owner: 10Dzahn)
[10:24:34] <wikibugs>	 (03CR) 10Juniorsys: [C: 031] nfs: Move backups to secondary DC to different times [puppet] - 10https://gerrit.wikimedia.org/r/333327 (owner: 10Madhuvishy)
[10:28:13] <icinga-wm>	 RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[10:28:45] <wikibugs>	 (03CR) 10Juniorsys: [C: 031] Move some production apache config files to templates [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk)
[10:52:23] <icinga-wm>	 PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:21:23] <icinga-wm>	 RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[12:00:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:01:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy
[13:00:43] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:29:43] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[14:40:23] <icinga-wm>	 PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:08:23] <icinga-wm>	 RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[16:03:33] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.111 second response time
[16:13:13] <icinga-wm>	 PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:23:56] <wikibugs>	 (03Draft1) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358
[16:24:00] <wikibugs>	 (03PS2) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358
[16:24:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox)
[16:25:20] <wikibugs>	 (03PS3) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358
[16:30:33] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.126 second response time
[16:42:13] <icinga-wm>	 RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[16:53:54] <wikibugs>	 (03PS4) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358
[16:55:44] <wikibugs>	 (03CR) 10Paladox: [C: 031] "Tested and the init script works." [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox)
[17:05:23] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[17:05:43] <icinga-wm>	 PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:30:43] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service]
[17:32:12] <wikibugs>	 (03CR) 10Dereckson: [C: 031] Add *.finds.org.uk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333294 (https://phabricator.wikimedia.org/T155844) (owner: 10Urbanecm)
[17:32:23] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[17:33:43] <icinga-wm>	 RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[17:59:43] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[18:02:44] <wikibugs>	 (03CR) 10Paladox: [C: 031] "I tested like" [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox)
[18:09:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This patch in its current form is not only wrong, but also avoids using" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox)
[18:09:19] <_joe_>	 how can it work paladox ?
[18:09:27] <_joe_>	 it has a series of clear-cut errors
[18:09:47] <_joe_>	 oh you mean the init script
[18:10:00] <paladox>	 _joe_, i just realised that sudo service wont actually work with scripts, though i could not test that as the systemd one was installed.
[18:10:12] <paladox>	 i tested using ./phd start which worked.
[18:12:16] <paladox>	 _joe_ how does this base::service_unit work?
[18:27:18] <paladox>	 _joe_ im wondering could you help with service_unit please? As it looks like it may not work with phd, as we are running phabricators phd from /srv/phab/phabricator/bin/phd same for sytemd
[19:18:57] <wikibugs>	 (03PS1) 10Urbanecm: Add n, n:es and n:fr as import sources in test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333362 (https://phabricator.wikimedia.org/T155906)
[19:37:53] <_joe_>	 paladox: sorry, had to go afk
[19:38:03] <icinga-wm>	 PROBLEM - Disk space on ms-be1013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda1 is not accessible: Input/output error
[19:38:13] <paladox>	 oh
[19:38:29] <_joe_>	 I'd need to take a look at phd, to understand how to do that properly, but
[19:38:49] <_joe_>	 why do you need a sysv init script?
[19:38:57] <_joe_>	 isn't phabricator running on debian jessie?
[19:39:12] <paladox>	 _joe_ no
[19:39:13] <paladox>	 iridium is ubuntu trusty
[19:39:28] <_joe_>	 oh ok so you want an upstart script, probably
[19:39:33] <paladox>	 we are in the process of migrating it to a new server that runs debian, but currently i have no idea on the status of that.
[19:39:35] <paladox>	 oh yep
[19:40:21] <_joe_>	 yeah if I have time I can give a look, but probably I'll just ask mutante about that. I don't really have much time for this, sorry :/
[19:40:54] <paladox>	 Ok
[19:52:17] <_joe_>	 I mean I'll ask him what needs to be done
[19:53:43] <icinga-wm>	 PROBLEM - MegaRAID on ms-be1013 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)
[19:53:54] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on ms-be1013 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T155907
[19:53:58] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T155907#2958424 (10ops-monitoring-bot)
[19:55:24] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 609 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3080673 keys, up 82 days 11 hours - replication_delay is 609
[19:57:24] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3077580 keys, up 82 days 11 hours - replication_delay is 0
[20:00:32] <logmsgbot>	 !log legoktm@tin Synchronized php-1.29.0-wmf.8/includes: Revert "Added reason suggestion in block/delete/protect forms" (1/2) - T34950 (duration: 01m 31s)
[20:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:38] <stashbot>	 T34950: Use jQuery.suggestions to add reason suggestions to block/delete/protect forms - https://phabricator.wikimedia.org/T34950
[20:01:30] <logmsgbot>	 !log legoktm@tin Synchronized php-1.29.0-wmf.8/resources: Revert "Added reason suggestion in block/delete/protect forms" (1/2) - T34950 (duration: 00m 39s)
[20:01:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:43] <legoktm>	 oops, that should have said 2/2
[20:01:48] <wikibugs>	 (03PS2) 10Urbanecm: Add n, n:es and n:fr as import sources in test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333362 (https://phabricator.wikimedia.org/T155906)
[20:02:04] <icinga-wm>	 RECOVERY - Disk space on ms-be1013 is OK: DISK OK
[20:02:47] <logmsgbot>	 !log legoktm@tin Synchronized php-1.29.0-wmf.8/RELEASE-NOTES-1.29: for completeness (duration: 00m 39s)
[20:02:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:14] <icinga-wm>	 PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sda1]
[20:25:24] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3079087 keys, up 82 days 12 hours - replication_delay is 619
[20:25:44] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 634 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3079272 keys, up 82 days 11 hours - replication_delay is 634
[20:28:12] <wikibugs>	 06Operations, 10Traffic, 07HTTPS: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#2958453 (10ema)
[20:30:44] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3078795 keys, up 82 days 12 hours - replication_delay is 0
[20:31:24] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3078753 keys, up 82 days 12 hours - replication_delay is 0
[20:41:44] <icinga-wm>	 PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.218 second response time
[20:42:44] <icinga-wm>	 RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.649 second response time
[20:48:14] <icinga-wm>	 PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:16:14] <icinga-wm>	 RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[21:45:24] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[21:46:24] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3079312 keys, up 82 days 13 hours - replication_delay is 0
[21:54:34] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:56:53] <logmsgbot>	 !log mobrovac@tin Starting deploy [changeprop/deploy@2b980fa]: (no message)
[21:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:47] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@2b980fa]: (no message) (duration: 00m 54s)
[21:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:34] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[22:29:04] <paladox>	 _joe_ hi, we can install sytemd on trusty, as i managed to do it as sudo apt-get install systemd
[22:29:19] <paladox>	 it installed from ubuntu as far as i can see so no third party.
[22:36:58] <hashar>	 paladox: na not going to happen.  Trusty uses upstart
[22:37:08] <hashar>	 if you really need systemd, use Debian Jessie :D
[22:38:39] <paladox>	 hashar i just installed systemd and it worked
[22:38:53] <paladox>	 hashar this is for phabricator
[22:39:12] <bd808>	 paladox: replacing the default init system in a distro is not "easy"
[22:39:20] <bd808>	 nor is it really smart
[22:39:50] <bd808>	 is there some specific benefit of systemd that you think will make phab better?
[22:40:45] <paladox>	 bd808 just i am using the one for gerrit, and it dosent work, so i am thinking it wont work for phabricator
[22:41:04] <bd808>	 "the one" what?
[22:41:53] <paladox>	 bd808 https://phabricator.wikimedia.org/diffusion/ODDX/browse/master/debian/gerrit.init
[22:42:24] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:42:52] <bd808>	 that's a sysv init script, not a systemd unit
[22:43:09] <paladox>	 yep, thats why i am trying to test systemd.
[22:43:23] <bd808>	 these words... don't make sense
[22:43:48] <bd808>	 you are trying to write a systemd unit for gerrit?
[22:43:57] <bd808>	 or doing something for phabricator?
[22:43:58] <paladox>	 bd808, running gerrit's init script as a sysvinit service fails, i try to stop it and it wont stop.
[22:44:19] <paladox>	 it would be similar for phabricator.
[22:44:34] <paladox>	 as they are running external scripts.
[22:45:28] <bd808>	 how so? phabricator is a php application; gerrit is java. phab uses it's own hoe grown demon system (phd). apples and watermelons as far as comparison goes
[22:45:37] <bd808>	 *home grown
[22:46:58] <bd808>	 if `start-stop-daemon -K` isn't killing gerrit's jvm process that probably has something to do with a custom signal handler their java code installs
[22:47:06] <paladox>	 oh
[22:47:32] <paladox>	 but it kills it if i run the script without sudo service. so doing ./gerrit works.
[22:48:53] <paladox>	 it also runs rm -f "$GERRIT_PID" "$GERRIT_RUN"
[22:50:08] <wikibugs>	 (03CR) 10Alex Monk: [C: 031] Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 (owner: 10Tim Landscheidt)
[22:50:28] <Zppix>	 paladox, do you know why Gerrit.wikimedia.org doesn't usually remember me (i use firefox on windows 7 btw) even if i click remember me? P.S. i wish i had some $GERRIT_RUN in me. :P
[22:50:54] <paladox>	 Zppix gerrit uses a cookie
[22:50:59] <paladox>	 do you use private browsing?
[23:08:28] <wikibugs>	 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2958585 (10Krenair) #2 is now done too.
[23:11:24] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[23:21:41] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2958592 (10Krenair) a:03Krenair
[23:28:37] <wikibugs>	 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2958596 (10bd808)
[23:28:40] <wikibugs>	 07Puppet, 10Beta-Cluster-Infrastructure: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2958597 (10Krenair) I have a commit on -puppetmaster02 that does this, and it seems to mostly work. It seems to only include ecdsa keys, but it excludes hosts t...
[23:30:49] <p858snake>	 robh: just having a look at random old but open operations tasks, did someone forget to close out? https://phabricator.wikimedia.org/T86541
[23:39:24] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:39:54] <wikibugs>	 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2958613 (10Ricordisamoa) >>! In T153563#2949967, @Smalyshev wrote: > @Ricordisamoa I don't think reparsing object identifiers is a good id...