[00:20:58] !log restbase deploying 7c753fe6 [00:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:03] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:32:13] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [00:32:30] * volans looking [00:33:23] oom-killer [00:36:03] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [00:36:13] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [00:36:26] !log restarted carbon-cache@c on graphite1003 (was killed by oom-killer) [00:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:23] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [00:42:01] 06Operations, 10Graphite, 10Monitoring: Fix permissions for systemd file - https://phabricator.wikimedia.org/T155869#2957345 (10Volans) [00:42:17] 06Operations, 10Monitoring: Fix permissions for systemd file - https://phabricator.wikimedia.org/T155869#2957358 (10Volans) [00:59:54] 06Operations, 10Parsoid, 13Patch-For-Review, 15User-mobrovac: Parsoid: fix logrotate - https://phabricator.wikimedia.org/T155768#2957367 (10mobrovac) >>! In T155768#2955291, @Joe wrote: > We might need to amend service-runner to be able to rotate logs better. Sending a SIGHUP to a service-runner master pr... [01:06:23] PROBLEM - Host labstore1004 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:33] PROBLEM - Host analytics1031 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:34] PROBLEM - Host analytics1029 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:34] PROBLEM - Host analytics1028 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:34] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/0/2: down - Core: asw-c-eqiad:xe-2/0/0 {#3458} [10Gbps DF]BR [01:06:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 208, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/0/2: down - Core: asw-c-eqiad:xe-2/1/2 {#3464} [10Gbps DF]BR [01:06:43] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: Generic error: paths [01:06:53] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200): /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200): /pag [01:06:53] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200) [01:06:53] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 504 (expecting: 200) [01:06:53] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 504 (expecting: 200): /data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaT [01:06:53] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:53] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:53] PROBLEM - restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:06] RECOVERY - configured eth on lvs1003 is OK: OK - interfaces up [01:08:06] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [01:08:23] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [01:08:43] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [01:08:43] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [01:08:53] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [01:08:53] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [01:08:53] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [01:09:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [01:10:04] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:10:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:12:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [01:12:28] looking, it seems like a transient network issue in eqiad ^ [01:13:18] cr1 and cr2 are alerting ^ [01:14:03] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:17:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [01:18:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:19:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:20:50] !log mobrovac@tin Starting deploy [changeprop/deploy@eb27062]: (no message) [01:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:27] mobrovac, this is not the best timing... [01:21:54] !log mobrovac@tin Finished deploy [changeprop/deploy@eb27062]: (no message) (duration: 01m 03s) [01:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:03] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:22:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:31:43] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:38:28] 06Operations, 10Graphite, 10Monitoring: graphite1003 short of available RAM - https://phabricator.wikimedia.org/T155872#2957436 (10Volans) [01:52:33] PROBLEM - configured eth on labstore1004 is CRITICAL: eth1 reporting no carrier. [01:54:15] expected, ops is working on it [01:54:33] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [01:59:43] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [02:01:33] RECOVERY - configured eth on labstore1004 is OK: OK - interfaces up [02:02:01] !log l10nupdate@tin LocalisationUpdate failed: git pull of extensions failed [02:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:53] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [02:35:23] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:03:23] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [03:06:16] 06Operations, 07Puppet, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2757207 (10scfc) +1. The last time this bugged me I thought maybe the Puppet roles were re-read from the filesystem each time, looked at the source (`modules/openstack/files/liberty/horizon/puppet... [03:28:03] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 791.23 seconds [03:35:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.06 seconds [03:54:39] 06Operations, 06Labs, 10netops: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2957555 (10faidon) [03:56:35] (03CR) 10Tim Landscheidt: "This seems to be working only on Jessie (python-keystoneclient 2.3.1-3~bpo8+1, pulled from http://apt.wikimedia.org/wikimedia/), but not T" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [04:00:13] PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:02:13] PROBLEM - carbon-cache@c service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@c is failed [04:02:34] carbon-cache@c killed by OOM killer again ^ [04:03:48] !log graphite1003: carbon-cache@c restarted, it's been killed by OOM killer again [04:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:13] RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational [04:04:13] RECOVERY - carbon-cache@c service on graphite1003 is OK: OK - carbon-cache@c is active [04:13:39] (03CR) 10Tim Landscheidt: "Sorry, I got confused. On toolsbeta-puppetmaster7, expand_path is set to common for hiera, thus setting openstack::version gets set to "l" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [04:29:03] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:29:13] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [04:32:36] 06Operations, 10Graphite: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957609 (10ema) p:05Triage>03High [04:33:58] 06Operations, 10Graphite: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957597 (10ema) [04:39:03] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [04:39:13] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [04:49:45] (03CR) 10Tim Landscheidt: "Actually: No. After monkey-patching $::openstack::version (it is also set in modules/openstack/manifests/init.pp), Puppet fails because i" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [04:53:32] (03PS3) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 [04:54:51] (03CR) 10Tim Landscheidt: [C: 04-1] "(Haven't tested after the change.)" [puppet] - 10https://gerrit.wikimedia.org/r/326892 (owner: 10Tim Landscheidt) [04:55:23] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:15:43] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:22:53] (03CR) 10Andrew Bogott: "Are you maybe missing an apt-get update run between puppet runs? I tested quite a lot of this on Trusty and tested it again just now. I " [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [05:24:23] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [05:27:48] (03CR) 10Andrew Bogott: "(Oh, said instance is packagetest.testlabs.eqiad.wmflabs, you probably have access.)" [puppet] - 10https://gerrit.wikimedia.org/r/329021 (https://phabricator.wikimedia.org/T104575) (owner: 10Alex Monk) [05:33:42] 06Operations, 10Graphite: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957616 (10Volans) See also T155872 [05:41:00] 06Operations, 10Graphite, 10Monitoring: Increased load on graphite1003, carbon-cache not autorestarting when killed by OOM - https://phabricator.wikimedia.org/T155876#2957618 (10ema) [05:44:43] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:01:43] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [06:24:33] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [06:29:43] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:33:43] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [06:35:45] (03PS1) 10Madhuvishy: nfs: Move backups to secondary DC to different times [puppet] - 10https://gerrit.wikimedia.org/r/333327 [06:37:23] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:39:23] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:41:23] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:43:33] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:48:33] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [06:52:33] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:02:43] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:10:33] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:19:33] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [07:32:23] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [08:00:23] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [08:31:15] (03CR) 10Hashar: "I wasn't aware of -x which is:" [puppet] - 10https://gerrit.wikimedia.org/r/333230 (https://phabricator.wikimedia.org/T155820) (owner: 10Hashar) [09:22:23] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [10:00:13] PROBLEM - puppet last run on elastic1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:52] (03PS8) 10Juniorsys: geowiki module: Lint changes + modes/umask quoting [puppet] - 10https://gerrit.wikimedia.org/r/332101 (https://phabricator.wikimedia.org/T93645) [10:15:07] (03PS7) 10Juniorsys: mediawiki module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332103 (https://phabricator.wikimedia.org/T93645) [10:15:17] (03PS7) 10Juniorsys: postgresql module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332104 (https://phabricator.wikimedia.org/T93645) [10:15:43] (03PS7) 10Juniorsys: toollabs role modules: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332110 (https://phabricator.wikimedia.org/T93645) [10:15:54] (03PS7) 10Juniorsys: toollabs module: Linting changes [puppet] - 10https://gerrit.wikimedia.org/r/332111 [10:19:49] (03CR) 10Juniorsys: [C: 031] kartotherian: optional parameter listed before required [puppet] - 10https://gerrit.wikimedia.org/r/332956 (owner: 10Dzahn) [10:24:34] (03CR) 10Juniorsys: [C: 031] nfs: Move backups to secondary DC to different times [puppet] - 10https://gerrit.wikimedia.org/r/333327 (owner: 10Madhuvishy) [10:28:13] RECOVERY - puppet last run on elastic1038 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:28:45] (03CR) 10Juniorsys: [C: 031] Move some production apache config files to templates [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [10:52:23] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:21:23] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:00:53] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:53] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [13:00:43] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:29:43] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [14:40:23] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:08:23] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:03:33] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.111 second response time [16:13:13] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:23:56] (03Draft1) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [16:24:00] (03PS2) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [16:24:47] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [16:25:20] (03PS3) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [16:30:33] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.126 second response time [16:42:13] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:53:54] (03PS4) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [16:55:44] (03CR) 10Paladox: [C: 031] "Tested and the init script works." [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [17:05:23] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [17:05:43] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:30:43] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[zotero/translators],Package[zotero/translation-server],Exec[chown /srv/deployment/zotero for deploy-service] [17:32:12] (03CR) 10Dereckson: [C: 031] Add *.finds.org.uk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333294 (https://phabricator.wikimedia.org/T155844) (owner: 10Urbanecm) [17:32:23] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:33:43] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:59:43] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:02:44] (03CR) 10Paladox: [C: 031] "I tested like" [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [18:09:02] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This patch in its current form is not only wrong, but also avoids using" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [18:09:19] <_joe_> how can it work paladox ? [18:09:27] <_joe_> it has a series of clear-cut errors [18:09:47] <_joe_> oh you mean the init script [18:10:00] _joe_, i just realised that sudo service wont actually work with scripts, though i could not test that as the systemd one was installed. [18:10:12] i tested using ./phd start which worked. [18:12:16] _joe_ how does this base::service_unit work? [18:27:18] _joe_ im wondering could you help with service_unit please? As it looks like it may not work with phd, as we are running phabricators phd from /srv/phab/phabricator/bin/phd same for sytemd [19:18:57] (03PS1) 10Urbanecm: Add n, n:es and n:fr as import sources in test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333362 (https://phabricator.wikimedia.org/T155906) [19:37:53] <_joe_> paladox: sorry, had to go afk [19:38:03] PROBLEM - Disk space on ms-be1013 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda1 is not accessible: Input/output error [19:38:13] oh [19:38:29] <_joe_> I'd need to take a look at phd, to understand how to do that properly, but [19:38:49] <_joe_> why do you need a sysv init script? [19:38:57] <_joe_> isn't phabricator running on debian jessie? [19:39:12] _joe_ no [19:39:13] iridium is ubuntu trusty [19:39:28] <_joe_> oh ok so you want an upstart script, probably [19:39:33] we are in the process of migrating it to a new server that runs debian, but currently i have no idea on the status of that. [19:39:35] oh yep [19:40:21] <_joe_> yeah if I have time I can give a look, but probably I'll just ask mutante about that. I don't really have much time for this, sorry :/ [19:40:54] Ok [19:52:17] <_joe_> I mean I'll ask him what needs to be done [19:53:43] PROBLEM - MegaRAID on ms-be1013 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [19:53:54] ACKNOWLEDGEMENT - MegaRAID on ms-be1013 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T155907 [19:53:58] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T155907#2958424 (10ops-monitoring-bot) [19:55:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 609 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3080673 keys, up 82 days 11 hours - replication_delay is 609 [19:57:24] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3077580 keys, up 82 days 11 hours - replication_delay is 0 [20:00:32] !log legoktm@tin Synchronized php-1.29.0-wmf.8/includes: Revert "Added reason suggestion in block/delete/protect forms" (1/2) - T34950 (duration: 01m 31s) [20:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:38] T34950: Use jQuery.suggestions to add reason suggestions to block/delete/protect forms - https://phabricator.wikimedia.org/T34950 [20:01:30] !log legoktm@tin Synchronized php-1.29.0-wmf.8/resources: Revert "Added reason suggestion in block/delete/protect forms" (1/2) - T34950 (duration: 00m 39s) [20:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:43] oops, that should have said 2/2 [20:01:48] (03PS2) 10Urbanecm: Add n, n:es and n:fr as import sources in test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/333362 (https://phabricator.wikimedia.org/T155906) [20:02:04] RECOVERY - Disk space on ms-be1013 is OK: DISK OK [20:02:47] !log legoktm@tin Synchronized php-1.29.0-wmf.8/RELEASE-NOTES-1.29: for completeness (duration: 00m 39s) [20:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:14] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sda1] [20:25:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 619 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3079087 keys, up 82 days 12 hours - replication_delay is 619 [20:25:44] PROBLEM - Redis status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 634 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3079272 keys, up 82 days 11 hours - replication_delay is 634 [20:28:12] 06Operations, 10Traffic, 07HTTPS: Add CAA records to our domains - https://phabricator.wikimedia.org/T155806#2958453 (10ema) [20:30:44] RECOVERY - Redis status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3078795 keys, up 82 days 12 hours - replication_delay is 0 [20:31:24] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3078753 keys, up 82 days 12 hours - replication_delay is 0 [20:41:44] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/precise - 185 bytes in 0.218 second response time [20:42:44] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.649 second response time [20:48:14] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:16:14] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [21:45:24] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [21:46:24] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3079312 keys, up 82 days 13 hours - replication_delay is 0 [21:54:34] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:56:53] !log mobrovac@tin Starting deploy [changeprop/deploy@2b980fa]: (no message) [21:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:47] !log mobrovac@tin Finished deploy [changeprop/deploy@2b980fa]: (no message) (duration: 00m 54s) [21:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:34] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [22:29:04] _joe_ hi, we can install sytemd on trusty, as i managed to do it as sudo apt-get install systemd [22:29:19] it installed from ubuntu as far as i can see so no third party. [22:36:58] paladox: na not going to happen. Trusty uses upstart [22:37:08] if you really need systemd, use Debian Jessie :D [22:38:39] hashar i just installed systemd and it worked [22:38:53] hashar this is for phabricator [22:39:12] paladox: replacing the default init system in a distro is not "easy" [22:39:20] nor is it really smart [22:39:50] is there some specific benefit of systemd that you think will make phab better? [22:40:45] bd808 just i am using the one for gerrit, and it dosent work, so i am thinking it wont work for phabricator [22:41:04] "the one" what? [22:41:53] bd808 https://phabricator.wikimedia.org/diffusion/ODDX/browse/master/debian/gerrit.init [22:42:24] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:42:52] that's a sysv init script, not a systemd unit [22:43:09] yep, thats why i am trying to test systemd. [22:43:23] these words... don't make sense [22:43:48] you are trying to write a systemd unit for gerrit? [22:43:57] or doing something for phabricator? [22:43:58] bd808, running gerrit's init script as a sysvinit service fails, i try to stop it and it wont stop. [22:44:19] it would be similar for phabricator. [22:44:34] as they are running external scripts. [22:45:28] how so? phabricator is a php application; gerrit is java. phab uses it's own hoe grown demon system (phd). apples and watermelons as far as comparison goes [22:45:37] *home grown [22:46:58] if `start-stop-daemon -K` isn't killing gerrit's jvm process that probably has something to do with a custom signal handler their java code installs [22:47:06] oh [22:47:32] but it kills it if i run the script without sudo service. so doing ./gerrit works. [22:48:53] it also runs rm -f "$GERRIT_PID" "$GERRIT_RUN" [22:50:08] (03CR) 10Alex Monk: [C: 031] Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 (owner: 10Tim Landscheidt) [22:50:28] paladox, do you know why Gerrit.wikimedia.org doesn't usually remember me (i use firefox on windows 7 btw) even if i click remember me? P.S. i wish i had some $GERRIT_RUN in me. :P [22:50:54] Zppix gerrit uses a cookie [22:50:59] do you use private browsing? [23:08:28] 06Operations, 07Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#2958585 (10Krenair) #2 is now done too. [23:11:24] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [23:21:41] 07Puppet, 10Beta-Cluster-Infrastructure: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2958592 (10Krenair) a:03Krenair [23:28:37] 06Operations, 07Puppet, 10Horizon, 06Labs: Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589#2958596 (10bd808) [23:28:40] 07Puppet, 10Beta-Cluster-Infrastructure: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2958597 (10Krenair) I have a commit on -puppetmaster02 that does this, and it seems to mostly work. It seems to only include ecdsa keys, but it excludes hosts t... [23:30:49] robh: just having a look at random old but open operations tasks, did someone forget to close out? https://phabricator.wikimedia.org/T86541 [23:39:24] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:39:54] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2958613 (10Ricordisamoa) >>! In T153563#2949967, @Smalyshev wrote: > @Ricordisamoa I don't think reparsing object identifiers is a good id...