[00:00:05] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T0000). Please do the needful. [00:00:24] ve, no way [00:00:38] ok :) [00:00:53] ACKNOWLEDGEMENT - Check systemd state on kafka-jumbo1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. ottomata ACK, will check on these tomorrow. Likely soemthing wrong with jmxtrans configs and new kafka [00:00:53] ACKNOWLEDGEMENT - jmxtrans on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar ottomata ACK, will check on these tomorrow. Likely soemthing wrong with jmxtrans configs and new kafka [00:00:53] ACKNOWLEDGEMENT - salt-minion processes on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion ottomata ACK, will check on these tomorrow. Likely soemthing wrong with jmxtrans configs and new kafka [00:00:53] ACKNOWLEDGEMENT - jmxtrans on kafka-jumbo1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar ottomata ACK, will check on these tomorrow. Likely soemthing wrong with jmxtrans configs and new kafka [00:00:53] ACKNOWLEDGEMENT - jmxtrans on kafka-jumbo1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar ottomata ACK, will check on these tomorrow. Likely soemthing wrong with jmxtrans configs and new kafka [00:00:53] ACKNOWLEDGEMENT - jmxtrans on kafka-jumbo1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar ottomata ACK, will check on these tomorrow. Likely soemthing wrong with jmxtrans configs and new kafka [00:00:53] ACKNOWLEDGEMENT - jmxtrans on kafka-jumbo1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar ottomata ACK, will check on these tomorrow. Likely soemthing wrong with jmxtrans configs and new kafka [00:00:54] ACKNOWLEDGEMENT - jmxtrans on kafka-jumbo1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar ottomata ACK, will check on these tomorrow. Likely soemthing wrong with jmxtrans configs and new kafka [00:01:13] thanks [00:02:30] Should I wait to do the phabricator deployment after you've got this straightened out? [00:02:43] (03CR) 10Chad: "Um, I'm not sure I like this pattern. We don't include things like this into InitialiseSettings. And considering the size of the included " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson) [00:05:32] twentyafterfour: no, you should just go as usual [00:06:01] ok [00:06:13] 10Operations, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3587032 (10Dzahn) [00:07:03] !log deploying phabricator update 2017-09-06 https://phabricator.wikimedia.org/project/view/2980/ [00:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:27] 10Operations, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Dzahn) it is using role spare:;system now and the only remnants are partman/DHCP and they should usually stay until the end. Giving task back t... [00:07:38] 10Operations, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3587034 (10Dzahn) a:05Dzahn>03None [00:07:50] 10Operations, 10ops-eqiad, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Dzahn) [00:09:04] 10Operations, 10ops-eqiad, 10Performance-Team, 10hardware-requests, 10Patch-For-Review: Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3582551 (10Dzahn) HW warranty expiration: **2017-03-23** https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=2160... [00:11:52] (03PS2) 10Dzahn: Phabricator: Block vandalism IP that repeatedly added comments / uploaded files [puppet] - 10https://gerrit.wikimedia.org/r/370630 (owner: 10Aklapper) [00:11:55] !log twentyafterfour@tin Started deploy [phabricator/deployment@c265c1a]: (no justification provided) [00:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:19] !log twentyafterfour@tin Finished deploy [phabricator/deployment@c265c1a]: (no justification provided) (duration: 00m 24s) [00:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:43] (03PS3) 10Dzahn: Phabricator: Block vandalism IP that repeatedly added comments / uploaded files [puppet] - 10https://gerrit.wikimedia.org/r/370630 (owner: 10Aklapper) [00:14:00] (03CR) 10Dzahn: [C: 032] Phabricator: Block vandalism IP that repeatedly added comments / uploaded files [puppet] - 10https://gerrit.wikimedia.org/r/370630 (owner: 10Aklapper) [00:17:45] (03PS2) 10Dzahn: contint: include mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/374999 (owner: 10Hashar) [00:18:51] (03CR) 10Dzahn: [C: 032] contint: include mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/374999 (owner: 10Hashar) [00:20:26] twentyafterfour: Phabricator seems down? [00:21:02] normal maintenance I think... [00:21:11] scheduled maintenance [00:21:24] Got it. Are these announced somewhere? [00:21:35] on the deploy calendar [00:22:08] Aha. I should add that calendar. [00:22:35] its handy :) [00:22:51] search gcal for "WMF Deployments" [00:23:16] (03PS5) 10Dzahn: Gerrit: Set autoMigrate so that changes are stored in both reviewdb and notedb [puppet] - 10https://gerrit.wikimedia.org/r/373520 (owner: 10Paladox) [00:23:26] No luck. I think I need the weird address from officewiki. [00:24:04] Niharika: wikimedia.org_rudis09ii2mm5fk4hgdjeh1u64@group.calendar.google.com [00:24:32] bd808: Thanks! [00:24:33] jouncebot: now [00:24:33] For the next 0 hour(s) and 35 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T0000) [00:24:53] !log phabricator update deployed [00:25:02] it should be back up now [00:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:12] wfm :) [00:25:30] (03CR) 10Dzahn: [C: 032] Gerrit: Set autoMigrate so that changes are stored in both reviewdb and notedb [puppet] - 10https://gerrit.wikimedia.org/r/373520 (owner: 10Paladox) [00:25:52] (03CR) 10Dzahn: "per "doesnt do anything until we upgrade to 2.15.x"" [puppet] - 10https://gerrit.wikimedia.org/r/373520 (owner: 10Paladox) [00:28:22] (03CR) 10Dzahn: [C: 04-1] "i think it should use systemctl and not the "service" command anymore" [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [00:29:23] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org, 10HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#3587052 (10bd808) >>! In T98813#3585990, @Jdforrester-WMF wrote: > Does that mean we can Just Do It™ now? See T168470. Cloud Services has 2 new physical servers that... [00:30:22] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org, 10HHVM: Move wikitech (silver) to HHVM - https://phabricator.wikimedia.org/T98813#3587066 (10Jdforrester-WMF) Brilliant. :-) [00:30:37] (03CR) 10Dzahn: [C: 031] "lgtm, but are the dependencies still going to happen? see comments on the 2 related changes" [puppet] - 10https://gerrit.wikimedia.org/r/374813 (owner: 10Hashar) [00:31:21] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org: Determine whether wikitech should really depend on production search cluster - https://phabricator.wikimedia.org/T110987#1591503 (10bd808) >>! In T110987#1610756, @chasemp wrote: > We could run a local instance of elasticsearch? Could we, probably. S... [00:33:00] (03CR) 10Dzahn: "let us know with a +1 once that happened" [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar) [00:55:53] (03CR) 10Krinkle: "Nope, good to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/376151 (https://phabricator.wikimedia.org/T175093) (owner: 10Krinkle) [01:00:23] (03PS2) 10Chad: Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [01:03:03] (03PS3) 10Chad: Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [01:39:36] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org: Determine whether wikitech should really depend on production search cluster - https://phabricator.wikimedia.org/T110987#3587170 (10Dzahn) The reasons to keep it self-contained (information available if other stuff is down) are probably also mitigate... [02:18:18] (03PS6) 10Ottomata: [WIP] Initial commit of certpy [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) [02:18:56] (03PS7) 10Ottomata: [WIP] Initial commit of certpy [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) [02:24:54] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.16) (duration: 07m 23s) [02:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:50] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.17) (duration: 06m 15s) [02:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:52:05] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 7 02:52:05 UTC 2017 (duration 7m 15s) [02:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:58] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 679.00 seconds [04:00:08] (03Abandoned) 10KartikMistry: Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry) [04:34:48] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 22 probes of 291 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [04:36:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 279.19 seconds [04:39:48] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 291 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [05:01:53] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3587254 (10Johnuniq) My understanding of the situati... [05:09:17] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3463343 (10Legoktm) AIUI the fix for this bug is cur... [05:10:18] PROBLEM - MegaRAID on db2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [05:10:20] ACKNOWLEDGEMENT - MegaRAID on db2010 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175228 [05:10:37] 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587259 (10ops-monitoring-bot) [05:19:55] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587269 (10Marostegui) @Papaul please change the disk whenever you can. Thanks! [06:17:48] PROBLEM - Check Varnish expiry mailbox lag on cp1063 is CRITICAL: CRITICAL: expiry mailbox lag is 2034916 [06:27:51] (03PS1) 10DCausse: Upgrade plugins to elastic 5.5.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) [06:28:09] PROBLEM - graphite.wikimedia.org on graphite1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [06:29:08] RECOVERY - graphite.wikimedia.org on graphite1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.011 second response time [06:43:35] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3587376 (10Multichill) >>! In T171392#3587254, @John... [06:46:54] (03PS9) 10Mobrovac: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 [06:46:59] RECOVERY - salt-minion processes on kafka-jumbo1001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:50:08] PROBLEM - salt-minion processes on kafka-jumbo1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [07:14:47] (03CR) 10Giuseppe Lavagetto: [C: 032] JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [07:16:24] (03Merged) 10jenkins-bot: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [07:16:52] (03CR) 10Mobrovac: JobQueue: Add the RunSingleJob.php script (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [07:20:13] !log oblivian@tin Synchronized rpc/RunSingleJob.php: Adding the RunSingleJob.php rpc endpoint for jobrunners (duration: 00m 56s) [07:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:58] !log mobrovac@tin Started deploy [restbase/deploy@b51c58c] (staging): Use MCS for the summary endpoint [07:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:41] !log mobrovac@tin Finished deploy [restbase/deploy@b51c58c] (staging): Use MCS for the summary endpoint (duration: 01m 43s) [07:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:16] !log mobrovac@tin Started deploy [restbase/deploy@b51c58c]: Use MCS for the summary endpoint - T168848 [07:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:29] T168848: Bootstrap an initial version of the Page Summary API in MCS - https://phabricator.wikimedia.org/T168848 [07:40:18] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180469.98 seconds [07:43:17] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376481 (https://phabricator.wikimedia.org/T172679) [07:43:33] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Add db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376481 (https://phabricator.wikimedia.org/T172679) [07:45:52] !log mobrovac@tin Finished deploy [restbase/deploy@b51c58c]: Use MCS for the summary endpoint - T168848 (duration: 06m 36s) [07:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:05] T168848: Bootstrap an initial version of the Page Summary API in MCS - https://phabricator.wikimedia.org/T168848 [07:50:49] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376481 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [07:52:26] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376481 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [07:53:58] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db1100 - T172679 (duration: 00m 49s) [07:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:10] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [07:54:39] RECOVERY - Disk space on stat1005 is OK: DISK OK [07:54:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db1100 - T172679 (duration: 00m 48s) [07:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:45] !log force re-mount of /mnt/hdfs on stat1005 [07:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [08:05:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:08:18] RECOVERY - salt-minion processes on kafka-jumbo1001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [08:09:48] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [08:09:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [08:10:59] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:11:58] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [08:15:44] (03CR) 10Jayprakash12345: "@Chad Thank you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [08:27:47] (03PS3) 10Addshore: Use the same name for group an user in wmde stats [puppet] - 10https://gerrit.wikimedia.org/r/375811 [08:29:50] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587467 (10jcrespo) @Marostegui, are we sure we want this done, and not get rid of the host directly?- it is a very old host and we have its replacements setup. I would ask how many spare disks we have left,... [08:31:46] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587473 (10Marostegui) >>! In T175228#3587467, @jcrespo wrote: > @Marostegui, are we sure we want this done, and not get rid of the host directly?- it is a very old host and we have its replacements setup. I... [08:34:34] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3587487 (10fgiunchedi) >>! In T158837#3584259, @faidon wrote: > Am I right to understand that the current plan is 2 VMs? If so, yeah, that sounds absolutely fine :)... [08:38:44] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587490 (10jcrespo) @Papaul Do you have plenty of old 300GB disks that would not be used otherwise or should we speed up the decomissioning (it will happen eventually, but right now we have other priorities). [08:39:42] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587492 (10Marostegui) a:03Papaul [08:40:08] (03PS4) 10Elukey: statistics::wmde: use the same name for group and user in wmde stats [puppet] - 10https://gerrit.wikimedia.org/r/375811 (owner: 10Addshore) [08:40:35] (03CR) 10Elukey: [C: 032] statistics::wmde: use the same name for group and user in wmde stats [puppet] - 10https://gerrit.wikimedia.org/r/375811 (owner: 10Addshore) [08:41:09] (03CR) 10jenkins-bot: Don't bother polluting function namespace, just use an anonymous one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374663 (owner: 10Chad) [08:41:11] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376291 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [08:41:13] (03CR) 10jenkins-bot: Fix $wgPropertySuggesterDeprecatedIds for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376317 (https://phabricator.wikimedia.org/T174962) (owner: 10Hoo man) [08:41:15] (03CR) 10jenkins-bot: labs: Remove OAuth setting duplicating prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370291 (owner: 10MaxSem) [08:41:17] (03CR) 10jenkins-bot: Flow settings: wmg -> wg migration, part 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370292 (owner: 10MaxSem) [08:44:17] !log restart varnish backend on cp1063 - mailbox expiry lag [08:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:51] 10Operations, 10Goal, 10Kubernetes, 10Services (watching): Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3587521 (10mobrovac) +1 on decoupling these concerns from the the running services. This model would allow developers to concentrate solely on their service's functi... [08:45:52] varnishlog -n cp1063 -g request looks good [08:46:53] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: WMF LDAP group access for pivot.wikimedia.org - https://phabricator.wikimedia.org/T175156#3587523 (10schoenbaechler) Thanks Daniel (@Dzahn), works as a charm! 👍 Have a nice Thursday! [08:47:48] RECOVERY - Check Varnish expiry mailbox lag on cp1063 is OK: OK: expiry mailbox lag is 0 [08:49:45] (03CR) 10jenkins-bot: JobQueue: Add the RunSingleJob.php script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370004 (owner: 10Mobrovac) [08:54:21] (03CR) 10jenkins-bot: Try fixing ACW setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376443 (owner: 10MaxSem) [08:55:05] (03PS1) 10Hashar: Build for php5.5 on jessie-wikimedia [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) [08:57:47] (03CR) 10ArielGlenn: [C: 032] add 'general' to the list of properties retrieved for siteinfo dumps [dumps] - 10https://gerrit.wikimedia.org/r/375982 (https://phabricator.wikimedia.org/T171400) (owner: 10ArielGlenn) [08:58:58] !log ariel@tin Started deploy [dumps/dumps@da05e9f]: add general info to siteinfo dumps [08:59:00] !log ariel@tin Finished deploy [dumps/dumps@da05e9f]: add general info to siteinfo dumps (duration: 00m 02s) [08:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:11] (03PS2) 10Alexandros Kosiaris: package_builder: test -nt differs in bash vs dash [puppet] - 10https://gerrit.wikimedia.org/r/376378 (owner: 10Hashar) [08:59:16] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] package_builder: test -nt differs in bash vs dash [puppet] - 10https://gerrit.wikimedia.org/r/376378 (owner: 10Hashar) [08:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:34] (03PS2) 10Alexandros Kosiaris: package_builder: typo: s/output/result/ directory [puppet] - 10https://gerrit.wikimedia.org/r/376322 (owner: 10Hashar) [09:02:39] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] package_builder: typo: s/output/result/ directory [puppet] - 10https://gerrit.wikimedia.org/r/376322 (owner: 10Hashar) [09:03:02] (03CR) 10jenkins-bot: Flow settings: wmg -> wg migration, part 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370293 (owner: 10MaxSem) [09:03:04] (03CR) 10jenkins-bot: Revert "Flow settings: wmg -> wg migration, part 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376325 (owner: 10MaxSem) [09:03:06] (03CR) 10jenkins-bot: Revert "Flow settings: wmg -> wg migration, part 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376326 (owner: 10MaxSem) [09:03:08] (03CR) 10jenkins-bot: Configure ACW for Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376431 (https://phabricator.wikimedia.org/T175054) (owner: 10Niharika29) [09:03:10] (03CR) 10jenkins-bot: Configure CirrusSearch human relevance survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374655 (https://phabricator.wikimedia.org/T174106) (owner: 10EBernhardson) [09:03:12] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376481 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:03:14] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2040 for reboot and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376220 (owner: 10Jcrespo) [09:05:04] 10Operations, 10Performance-Team, 10monitoring: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3587553 (10fgiunchedi) >>! In T158837#3584304, @Ottomata wrote: >> Any downtime permanently affects the graphs. > > Just an uninformed idea: If you produce direct... [09:09:03] (03PS4) 10ArielGlenn: convert "other" dump crons to use script to grabconfig settings [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) [09:09:06] (03PS1) 10Marostegui: db-eqiad.php: Add db1100 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376484 (https://phabricator.wikimedia.org/T172679) [09:10:04] (03CR) 10Hashar: [C: 04-1] "The redis.so seems to load fine with php5.5. It is missing /etc/php5/mods-available/redis.ini though :(" [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) (owner: 10Hashar) [09:10:22] (03PS2) 10Marostegui: db-eqiad.php: Add db1100 to s5 depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376484 (https://phabricator.wikimedia.org/T172679) [09:12:09] (03CR) 10Hashar: [C: 04-1] "Packages at https://people.wikimedia.org/~hashar/debs/php5.5-redis-jessie/" [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) (owner: 10Hashar) [09:13:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Add db1100 to s5 depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376484 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:13:57] (03PS5) 10ArielGlenn: convert "other" dump crons to use script to grabconfig settings [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) [09:14:41] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3587560 (10mobrovac) p:05Normal>03High [09:14:52] (03Merged) 10jenkins-bot: db-eqiad.php: Add db1100 to s5 depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376484 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:14:53] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3587566 (10Gehel) I'm taking over this task for service implementation. [09:15:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db1100 depooled to s5 array - T172679 (duration: 00m 49s) [09:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:11] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [09:16:42] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242#3587571 (10Gehel) [09:17:57] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242#3587589 (10Gehel) [09:19:57] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3587592 (10mobrovac) [09:24:43] (03CR) 10jenkins-bot: db-eqiad.php: Add db1100 to s5 depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376484 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:26:02] (03PS1) 10Gehel: logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) [09:26:02] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3587601 (10fgiunchedi) The cassandra 3 in production is indeed up now, a couple of followups left to do in Puppet: * Disable `cassandra-metrics-co... [09:26:34] (03CR) 10jerkins-bot: [V: 04-1] logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) (owner: 10Gehel) [09:27:59] (03PS2) 10Gehel: logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) [09:30:41] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3587610 (10Verdy_p) In addition I think the major pr... [09:44:07] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3587629 (10Verdy_p) >>! In T171392#3585563, @Anomie... [09:46:47] (03PS2) 10DCausse: Upgrade plugins to elastic 5.5.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) [09:51:08] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:51:19] (03PS2) 10Hashar: Build for php5.5 on jessie-wikimedia [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) [09:54:36] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T174777#3587683 (10fgiunchedi) a:05fgiunchedi>03Papaul Thanks @Papaul ! Disk rebuilding, kicking back to you in case you need to followup with the return before resolving. [09:56:36] (03CR) 10Hashar: [C: 04-1] "Had to rename a file to match the new binary package name:" [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) (owner: 10Hashar) [10:26:06] Hi ops-team [10:26:14] I'm experiencing a problem with Wikitech [10:26:28] I can't login: [b214bbd759784acae77fa366] /w/index.php?title=Special:UserLogin&returnto=Main+Page&returntoquery=action%3Dedit UnderflowException from line 413 of /srv/mediawiki/w/extensions/ConfirmEdit/FancyCaptcha/FancyCaptcha.class.php: Ran out of captcha images [10:28:36] (03CR) 10Hoo man: [C: 031] convert "other" dump crons to use script to grabconfig settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) (owner: 10ArielGlenn) [10:31:09] (03PS1) 10Ladsgroup: Add English Wiktionary as a client of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) [10:37:45] (03PS6) 10ArielGlenn: convert "other" dump crons to use script to grab config settings [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) [10:40:26] (03PS3) 10Hashar: Build for php5.5 on jessie-wikimedia [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) [10:41:41] (03CR) 10Hashar: "Hacked a symlink with dh_link to enable redis in the CLI via /etc/php/5.5/cli/conf.d/20-redis.ini" [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) (owner: 10Hashar) [10:42:18] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:42:42] (03CR) 10ArielGlenn: convert "other" dump crons to use script to grab config settings (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) (owner: 10ArielGlenn) [11:07:59] (03CR) 10Aude: [C: 04-1] "wmgWikibaseEnableData is still set to false for all wiktionary sites including enwiktionary." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) (owner: 10Ladsgroup) [11:09:48] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:10:29] joal: still seeing the issue? [11:11:03] elukey: I've been logged-in, so no more issues [11:11:18] Will try to logout and back in [11:11:28] just tried and it works [11:11:42] (03PS2) 10Ladsgroup: Add English Wiktionary as a client of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) [11:11:55] (03CR) 10Ladsgroup: "Thanks. Added." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) (owner: 10Ladsgroup) [11:12:08] elukey: Worked for me as well - Sorry for the noise, I don't know what happened [11:14:47] joal: you are a fancy bot! :P [11:15:34] elukey: Turing-test would probably say you're correct :) [11:18:01] (03CR) 10Aude: Add English Wiktionary as a client of Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) (owner: 10Ladsgroup) [11:20:09] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3587830 (10ovasileva) [11:20:16] (03PS1) 10Marostegui: db-eqiad.php: Pool db1100 with weight 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376499 (https://phabricator.wikimedia.org/T172679) [11:21:11] (03CR) 10Ladsgroup: Add English Wiktionary as a client of Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) (owner: 10Ladsgroup) [11:21:40] (03CR) 10Aude: Add English Wiktionary as a client of Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) (owner: 10Ladsgroup) [11:21:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1100 with weight 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376499 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [11:23:30] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1100 with weight 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376499 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [11:23:41] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1100 with weight 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376499 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [11:24:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1100 with 0 weight - T172679 (duration: 00m 49s) [11:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:51] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [11:37:07] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242#3587571 (10Ladsgroup) It seems striker, maps and aqs need fixing. [11:42:49] PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:42:58] ACKNOWLEDGEMENT - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175252 [11:43:01] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175252#3587864 (10ops-monitoring-bot) [11:43:19] PROBLEM - swift-container-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:43:58] PROBLEM - swift-object-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:08] PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:18] PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:18] PROBLEM - swift-object-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:19] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:29] PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:29] PROBLEM - Disk space on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:39] PROBLEM - HP RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:50] ACKNOWLEDGEMENT - HP RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175253 [11:44:50] PROBLEM - swift-object-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:50] PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:44:54] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175253#3587869 (10ops-monitoring-bot) [11:45:10] PROBLEM - swift-account-reaper on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:45:11] PROBLEM - salt-minion processes on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:45:11] PROBLEM - swift-account-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:45:20] PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:46:00] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:46:11] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:46:30] PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:46:50] PROBLEM - swift-container-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:46:52] (03PS1) 10Ladsgroup: service: Use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) [11:47:10] PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:47:30] PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:47:40] PROBLEM - very high load average likely xfs on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:47:51] PROBLEM - SSH on ms-be2023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:48:10] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:48:20] PROBLEM - configured eth on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:48:30] PROBLEM - DPKG on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:48:31] RECOVERY - swift-object-server on ms-be2023 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:48:31] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational [11:48:31] RECOVERY - swift-container-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:48:41] RECOVERY - Disk space on ms-be2023 is OK: DISK OK [11:48:41] RECOVERY - dhclient process on ms-be2023 is OK: PROCS OK: 0 processes with command name dhclient [11:48:41] RECOVERY - SSH on ms-be2023 is OK: SSH OK - OpenSSH_7.4p1 Debian-10 (protocol 2.0) [11:48:41] RECOVERY - swift-container-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:49:00] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 32 minutes ago with 0 failures [11:49:01] RECOVERY - swift-object-auditor on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:49:01] RECOVERY - swift-object-updater on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:49:10] RECOVERY - swift-object-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:49:11] RECOVERY - swift-account-reaper on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:49:11] RECOVERY - salt-minion processes on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [11:49:11] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2023 is OK: OK ferm input default policy is set [11:49:11] RECOVERY - configured eth on ms-be2023 is OK: OK - interfaces up [11:49:11] RECOVERY - swift-account-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:49:20] RECOVERY - swift-container-updater on ms-be2023 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:49:20] RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:49:21] RECOVERY - Check size of conntrack table on ms-be2023 is OK: OK: nf_conntrack is 1 % full [11:49:21] RECOVERY - DPKG on ms-be2023 is OK: All packages OK [11:49:30] RECOVERY - swift-account-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:50:31] RECOVERY - very high load average likely xfs on ms-be2023 is OK: OK - load average: 48.26, 73.52, 60.42 [11:54:50] RECOVERY - HP RAID on ms-be2023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [11:58:58] what's up with ms-be2023? It seems to be up and fine now [11:59:05] godog: ^ [11:59:32] it had a disk replaced earlier, maybe [11:59:52] the rebuilding just finished? [12:00:44] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=ms-be2023&refresh=1m&orgId=1 [12:00:54] load skyrocketed for a bit [12:09:26] (03CR) 10Hashar: "And installation fails because php5.5-redis ships /usr/share/php/.registry/.channel.pecl.php.net/redis.reg which is already in php5-redis " [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) (owner: 10Hashar) [12:36:45] (03PS1) 10Alexandros Kosiaris: Remove redundant double-quotes from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/376510 [12:37:21] (03PS4) 10Hashar: Build for php5.5 on jessie-wikimedia [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) [12:40:08] <_joe_> addshore: around? [12:40:56] _joe_: yup [12:41:27] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3587993 (10RHo) Hi @Dzahn - hope the following is what you're after, pasted in from the public key file generated: ``` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHuPUUK0SGl... [12:42:08] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM. @mobrovac, any objections ?" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [12:45:42] <_joe_> I'd give a heads up to gehel, rather :) [12:46:09] <_joe_> I'm not sure if this will have immediate effect everywhere, actually most services are not restarted automagically [12:46:19] <_joe_> but do /not/ force-run puppet everywhere in sync [12:49:19] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3588015 (10Joe) I did some more number crunching on the instances of runJob.php I'm running on terbium, I found what follows: **Wikibase `refreshlinks`... [12:58:57] _joe_: heads up on what? [12:59:08] * gehel is reading back but not finding the context... [13:00:02] <_joe_> gehel: moving all services to the logstash lvs endpoint, basically [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T1300). Please do the needful. [13:00:06] Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:11] o/ [13:00:43] Amir1: looks like your the only one! [13:00:55] o/ [13:01:02] _joe_: I cannot login anymore due to login {result Failed reason {You have made too many recent login attempts. Please wait 5 minutes before trying again.}} by API login for user:Doc_Taxon for more than 24 hours now. Can anyone give free the login for this user, please? (But a login onwiki is possible (shrug)) [13:01:07] _joe_: sounds like a nice idea :) [13:01:14] I can SWAT today, unless somebody else wants to? [13:01:22] * addshore runs away [13:01:36] * zeljkof looks at addshore o.O [13:01:53] for the record: I can SWAT today! :) [13:02:12] * gehel has found the context! Thx _joe_ ! [13:02:42] 10Operations, 10Pybal, 10Traffic, 10monitoring, 10Patch-For-Review: pybal: add prometheus metrics - https://phabricator.wikimedia.org/T171710#3588053 (10ema) >>! In T171710#3584226, @faidon wrote: > I know a bunch of work happened during the Wikimania hackathon, but what's the status of this? Most of th... [13:03:14] can anyone help me anyhow? ^ [13:03:28] <_joe_> doctaxon: I don't think I can help you on that, sorry [13:03:47] Amir1: looks like your 376495 is the only thing for swat [13:03:53] * zeljkof is reviewing 376495 [13:04:19] yeah [13:04:34] (03CR) 10Gehel: [C: 031] "Note that as far as I know, we only have a few services sending logs through LVS at the moment. I know the syslog (10514/UDP) is being use" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [13:05:16] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) (owner: 10Ladsgroup) [13:05:24] (03CR) 10Gehel: [C: 031] "Btw, big thanks on doing this cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [13:06:47] (03Merged) 10jenkins-bot: Add English Wiktionary as a client of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) (owner: 10Ladsgroup) [13:06:57] (03CR) 10jenkins-bot: Add English Wiktionary as a client of Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376495 (https://phabricator.wikimedia.org/T159316) (owner: 10Ladsgroup) [13:08:58] Amir1: 376495 is at mwdebug1002 [13:09:06] okay, thanks [13:09:07] please test and let me know if I can deploy [13:09:48] Amir1: any order in which files should be deployed? or any order would do? [13:10:34] zeljkof: works fine, 1- the dblist first (it's better to do it that way but I doubt any order would break things) [13:10:57] Amir1: ok, deploying, dblist first [13:12:13] !log zfilipin@tin Synchronized dblists/wikidataclient.dblist: SWAT: [[gerrit:376495|Add English Wiktionary as a client of Wikidata (T159316)]] (duration: 00m 49s) [13:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:25] T159316: Enable arbitrary access on English Wiktionary - https://phabricator.wikimedia.org/T159316 [13:13:07] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:376495|Add English Wiktionary as a client of Wikidata (T159316)]] (duration: 00m 49s) [13:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:21] Amir1: deployed! please test [13:13:53] anything else for SWAT? or can I close it? [13:14:10] zeljkof: works fine: https://en.wiktionary.org/w/index.php?title=Category:Persian_adjectives&diff=47458071&oldid=47080945 [13:14:11] Thanks [13:14:27] Amir1: thanks for releasing with #releng! ;) [13:14:33] :D [13:14:50] !log EU SWAT finished [13:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:43] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3588101 (10jcrespo) Could, at least, that part have something to do with T164173, as a problem from the same cause, or a consequence of the fix? I also r... [13:21:40] (03PS1) 10Elukey: Revert "role::mariadb::analytics::custom_repl_slave: raise el_sync batch to 10k" [puppet] - 10https://gerrit.wikimedia.org/r/376513 [13:22:10] (03CR) 10jerkins-bot: [V: 04-1] Revert "role::mariadb::analytics::custom_repl_slave: raise el_sync batch to 10k" [puppet] - 10https://gerrit.wikimedia.org/r/376513 (owner: 10Elukey) [13:22:18] too long? [13:22:57] 10Operations, 10ops-eqiad, 10DBA: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3588137 (10Marostegui) [13:23:14] 10Operations, 10ops-eqiad, 10DBA: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3588137 (10Marostegui) p:05Triage>03Normal [13:24:02] (03PS2) 10Elukey: Revert "role::mariadb::analytics::custom_repl_slave: el_sync batch 10k" [puppet] - 10https://gerrit.wikimedia.org/r/376513 [13:24:20] (03PS3) 10Elukey: Revert "role::mariadb::analytics::custom_repl_slave: el_sync batch 10k" [puppet] - 10https://gerrit.wikimedia.org/r/376513 [13:24:24] (03CR) 10jerkins-bot: [V: 04-1] Revert "role::mariadb::analytics::custom_repl_slave: el_sync batch 10k" [puppet] - 10https://gerrit.wikimedia.org/r/376513 (owner: 10Elukey) [13:24:42] (03CR) 10jerkins-bot: [V: 04-1] Revert "role::mariadb::analytics::custom_repl_slave: el_sync batch 10k" [puppet] - 10https://gerrit.wikimedia.org/r/376513 (owner: 10Elukey) [13:26:04] (03PS4) 10Elukey: Revert eventlogging_sync batch size back to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/376513 [13:26:56] (03CR) 10Elukey: [C: 032] Revert eventlogging_sync batch size back to 1000 [puppet] - 10https://gerrit.wikimedia.org/r/376513 (owner: 10Elukey) [13:28:06] ema: odd (re: ms-be2023) it had a broken disk that got replaced [13:30:10] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:11] PROBLEM - salt-minion processes on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:11] PROBLEM - swift-account-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:20] PROBLEM - swift-container-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:20] PROBLEM - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:31] ACKNOWLEDGEMENT - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175267 [13:30:32] PROBLEM - very high load average likely xfs on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:32] PROBLEM - swift-account-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:34] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175267#3588187 (10ops-monitoring-bot) [13:30:51] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:30:51] PROBLEM - swift-container-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:31:02] PROBLEM - dhclient process on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:31:02] PROBLEM - swift-container-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:31:07] I'll silence it now [13:31:21] PROBLEM - swift-object-updater on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:31:21] PROBLEM - swift-object-auditor on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:31:31] PROBLEM - swift-object-replicator on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:31:31] PROBLEM - configured eth on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:31:41] PROBLEM - Check size of conntrack table on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:01] PROBLEM - Disk space on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:02] PROBLEM - swift-account-server on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:33:13] (03PS1) 10Giuseppe Lavagetto: jobrunner: add discovery data [puppet] - 10https://gerrit.wikimedia.org/r/376516 (https://phabricator.wikimedia.org/T174599) [13:33:50] ACKNOWLEDGEMENT - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175271 [13:33:55] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175271#3588240 (10ops-monitoring-bot) [13:34:30] srsly... another way that check_nrpe can fail... [13:34:54] <_joe_> ahah [13:35:42] mmmh interestingly enough we already had that in the blacklist [13:35:44] https://github.com/wikimedia/puppet/blob/production/modules/icinga/files/raid_handler.py#L23 [13:35:47] I'll check the logs [13:36:14] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175271#3588253 (10Volans) 05Open>03Invalid False positive, I'll check why was not blacklisted [13:36:33] godog: is ms-be2023 overloaded to get back the missing data? [13:39:30] ACKNOWLEDGEMENT - HP RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175275 [13:39:33] _joe_: this was pebcak on the case [13:39:33] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175275#3588300 (10ops-monitoring-bot) [13:41:22] ACKNOWLEDGEMENT - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175276 [13:41:24] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175276#3588316 (10ops-monitoring-bot) [13:41:34] volans: I think so but I'll check, I'm not sure yet [13:42:48] (03PS1) 10Volans: Icinga: RAID handler ignore case for blacklist [puppet] - 10https://gerrit.wikimedia.org/r/376517 [13:42:53] this should fix it, do you mind a quick review? ^^ [13:44:34] <_joe_> volans: how would this not break other things? [13:44:47] <_joe_> oh sorry, read it incorrectly [13:44:55] (03PS2) 10Volans: Icinga: RAID handler ignore case for blacklist [puppet] - 10https://gerrit.wikimedia.org/r/376517 [13:44:56] nitpick coming [13:45:06] <_joe_> I mistakenly read if instead of for [13:45:07] <_joe_> :P [13:45:10] lol [13:45:34] ACKNOWLEDGEMENT - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175277 [13:45:37] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175277#3588332 (10ops-monitoring-bot) [13:45:45] <_joe_> I usually use re.I, but LGTM :P [13:46:11] (03CR) 10Giuseppe Lavagetto: [C: 031] Icinga: RAID handler ignore case for blacklist [puppet] - 10https://gerrit.wikimedia.org/r/376517 (owner: 10Volans) [13:46:14] (03CR) 10Volans: [C: 032] Icinga: RAID handler ignore case for blacklist [puppet] - 10https://gerrit.wikimedia.org/r/376517 (owner: 10Volans) [13:46:26] knfgfdckrgnrbjrceercbvnjufjjhtdnv [13:46:53] nice :D [13:48:13] (03PS1) 10Giuseppe Lavagetto: Add discovery entry for jobrunner, active/passive [dns] - 10https://gerrit.wikimedia.org/r/376518 (https://phabricator.wikimedia.org/T174599) [13:48:55] ACKNOWLEDGEMENT - MD RAID on ms-be2023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T175278 [13:48:58] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175278#3588341 (10ops-monitoring-bot) [13:49:58] merged on icinga [13:50:28] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175278#3588346 (10Volans) 05Open>03Invalid [13:50:35] sorry for the spam, closing the invalid tasks [13:50:47] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175275#3588347 (10Volans) 05Open>03Invalid [13:51:01] (03PS2) 10Giuseppe Lavagetto: jobrunner: add discovery data [puppet] - 10https://gerrit.wikimedia.org/r/376516 (https://phabricator.wikimedia.org/T174599) [13:51:10] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175276#3588349 (10Volans) 05Open>03Invalid [13:51:19] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175277#3588352 (10Volans) 05Open>03Invalid [13:51:36] RECOVERY - swift-object-updater on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:51:36] RECOVERY - swift-object-auditor on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [13:51:37] RECOVERY - configured eth on ms-be2023 is OK: OK - interfaces up [13:51:37] RECOVERY - swift-object-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:51:37] RECOVERY - salt-minion processes on ms-be2023 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:51:46] RECOVERY - swift-account-auditor on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:51:56] RECOVERY - Check size of conntrack table on ms-be2023 is OK: OK: nf_conntrack is 0 % full [13:51:57] RECOVERY - swift-container-updater on ms-be2023 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [13:51:57] RECOVERY - very high load average likely xfs on ms-be2023 is OK: OK - load average: 74.32, 76.20, 62.49 [13:51:57] RECOVERY - swift-account-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:52:07] RECOVERY - swift-container-replicator on ms-be2023 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:52:16] RECOVERY - Disk space on ms-be2023 is OK: DISK OK [13:52:17] RECOVERY - swift-account-server on ms-be2023 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:52:17] RECOVERY - dhclient process on ms-be2023 is OK: PROCS OK: 0 processes with command name dhclient [13:52:17] RECOVERY - swift-container-server on ms-be2023 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [13:52:27] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: add discovery data [puppet] - 10https://gerrit.wikimedia.org/r/376516 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto) [13:52:29] sigh, login as root doesn't work from console, I'll reboot [13:52:42] no wait [13:52:47] let me see if it fixes :D [13:52:54] * volans joking ofc [13:53:01] lolz [13:53:55] I got a timeout now, but not SSL handshake failures [13:54:01] :-P [13:54:31] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=jobrunner,name=eqiad [13:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:57] !log powercycle ms-be2023 - load through the roof and no login possible [13:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:56] <_joe_> volans: oh the immediate output for a 1-node selection now works <3 [13:56:16] yep [13:56:22] one of the new features [13:56:51] <_joe_> yeah i remember :P [13:59:06] RECOVERY - MD RAID on ms-be2023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:59:47] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 42 minutes ago with 0 failures [14:00:11] (03CR) 10Giuseppe Lavagetto: [C: 032] Add discovery entry for jobrunner, active/passive [dns] - 10https://gerrit.wikimedia.org/r/376518 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto) [14:05:48] (03CR) 10Mobrovac: [C: 031] "LGTM too, but we need to sync on the roll-out, since this change will restart many a service, and some will need to be restarted manually." [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [14:07:21] (03PS1) 10Ema: varnish: convert role::cache::instances into a class [puppet] - 10https://gerrit.wikimedia.org/r/376521 [14:08:02] (03PS2) 10Alexandros Kosiaris: Remove redundant double-quotes from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/376510 [14:10:11] (03CR) 10Alexandros Kosiaris: [C: 032] Remove redundant double-quotes from LVS config [puppet] - 10https://gerrit.wikimedia.org/r/376510 (owner: 10Alexandros Kosiaris) [14:12:28] (03PS1) 10Rush: openstack: clean up kilo files and templates [puppet] - 10https://gerrit.wikimedia.org/r/376522 (https://phabricator.wikimedia.org/T171494) [14:13:21] (03PS2) 10Rush: openstack: clean up kilo files and templates [puppet] - 10https://gerrit.wikimedia.org/r/376522 (https://phabricator.wikimedia.org/T171494) [14:14:58] (03CR) 10Paladox: [C: 04-1] "Testing this, seems to break it for users that use upper case username. -1 for now until i figure out why this is failing." [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox) [14:15:08] (03CR) 10Ema: "pcc seems to like this: https://puppet-compiler.wmflabs.org/compiler02/7757/" [puppet] - 10https://gerrit.wikimedia.org/r/376521 (owner: 10Ema) [14:17:06] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175267#3588426 (10Volans) 05Open>03Invalid [14:17:14] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175253#3588427 (10Volans) 05Open>03Invalid [14:17:23] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T175252#3588430 (10Volans) 05Open>03Invalid [14:17:30] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3547586 (10mobrovac) >>! In T173710#3588015, @Joe wrote: > Wikibase `refreshlinks` jobs might benefit from being in smaller batches +1 on this. As we ha... [14:21:18] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational [14:32:24] 10Operations, 10Continuous-Integration-Config, 10Release-Engineering-Team (Backlog): operations-puppet-tests-docker console output lacks color - https://phabricator.wikimedia.org/T175057#3588464 (10Dzahn) p:05Triage>03Low [14:33:11] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3588473 (10Anomie) >>! In T171392#3587254, @Johnuniq... [14:35:42] (03CR) 10Rush: [C: 032] openstack: clean up kilo files and templates [puppet] - 10https://gerrit.wikimedia.org/r/376522 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:36:07] 10Operations, 10Analytics-Kanban, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3588483 (10Dzahn) p:05Triage>03Normal [14:36:18] 10Operations, 10Analytics-Kanban, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518505 (10Dzahn) 05Open>03stalled [14:36:38] 10Operations, 10Analytics-Kanban, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518505 (10Dzahn) [14:36:40] 10Operations, 10ops-eqiad, 10Analytics: Remove stat1002 - https://phabricator.wikimedia.org/T173094#3588487 (10Dzahn) [14:37:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3588490 (10Dzahn) p:05Triage>03Normal [14:37:57] 10Operations, 10Ops-Access-Requests: Change prod uid from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3588491 (10Dzahn) [14:40:31] !log rebooting and upgrading es1019 [14:40:34] 10Operations, 10Ops-Access-Requests: Change prod uid from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3588515 (10Dzahn) [14:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:54] 10Operations, 10Ops-Access-Requests: Change prod uid from diego to dsaez, so it can match with the ldap uid - https://phabricator.wikimedia.org/T175220#3587007 (10Dzahn) p:05Triage>03High [14:41:53] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3512337 (10Dzahn) 05Resolved>03Open please see T175220 [14:44:07] 10Operations, 10Kubernetes, 10Prod-Kubernetes (Experiment), 10User-Joe: Make security updates of docker images manageable - https://phabricator.wikimedia.org/T167269#3588529 (10akosiaris) [14:44:09] 10Operations, 10Operations-Software-Development: New tool to track package updates/status for hosts and images (debmonitor) - https://phabricator.wikimedia.org/T167504#3588531 (10akosiaris) [14:44:15] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3588532 (10Joe) [14:44:32] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 3 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3588533 (10mobrovac) [14:44:54] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 3 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3588552 (10mobrovac) [14:44:59] <_joe_> !log manually running /usr/share/mdadm/checkarray --cron --all --idle --quiet on conf2001, trying to reproduce consensus issues in T162013 [14:45:03] 10Operations, 10Analytics, 10Traffic: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3568769 (10BBlack) The model's a bit different in the wikimedia.org case, I'm not even sure there's a rational answer here. Can w... [14:45:11] <_joe_> opsens: this *might* page [14:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:12] T162013: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013 [14:45:37] let's see! [14:46:56] <_joe_> volans: I suspect we need all three running at the same time [14:47:02] <_joe_> but let's start with one [14:47:10] <_joe_> also, I ahve a meeting in 15 minutes or so... [14:47:16] perfect timing! [14:49:49] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3588564 (10mobrovac) [14:51:26] <_joe_> consensus lost! [14:51:39] <_joe_> volans: we might have a winner, great catch [14:51:43] \o/ [14:52:08] * volans wonders why this doesn't happen in eqiad, different hardware/raid setup? [14:52:27] <_joe_> still not enough to have it fail completely though [14:59:03] 10Operations, 10Analytics, 10Traffic: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3568769 (10Nuria) I think here we should not think of global unique devices for wikimedia.org domains and rather use just per-doma... [14:59:28] 10Operations, 10Analytics, 10Traffic: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie - https://phabricator.wikimedia.org/T174640#3568769 (10JAllemandou) Thanks @BBlack for the detailed explanations :) As for using the full `host` header value for wikimedia.or... [15:04:24] (03PS1) 10Rush: openstack: cleanup keystone references in old module [puppet] - 10https://gerrit.wikimedia.org/r/376531 (https://phabricator.wikimedia.org/T171494) [15:04:55] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Access for new Research Scientist: Diego Saez - https://phabricator.wikimedia.org/T172891#3588593 (10RobH) a:05RobH>03None [15:05:07] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3588594 (10Papaul) @jcrespo technically we do not have any 300GB spare disks. I am trying to load my Google spreadsheet for server decommission to see if we we do have a server with 300GB but can't for the m... [15:09:51] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3588596 (10jcrespo) @Papaul no problem- do not work too hard, we may replace the full server soon. [15:16:30] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 5 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3588604 (10mobrovac) [15:17:36] 10Operations, 10MediaWiki-Vagrant, 10Release-Engineering-Team, 10Epic, 10Patch-For-Review: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2334755 (10Dzahn) Meanwhile production is on their way to stretch and jessie is already oldstable. Would it make sense to consi... [15:17:57] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3419870 (10GWicke) I don't have strong views on how to scale metrics and log collection. In any case, we have been doing this remotely for a while now... [15:19:13] 10Operations, 10Security-Team, 10monitoring: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#3588623 (10Dzahn) [15:21:00] 10Operations, 10DBA, 10Patch-For-Review: dbtree: don't return 200 on error pages - https://phabricator.wikimedia.org/T163143#3588637 (10Dzahn) 05Open>03Resolved a:03Dzahn https://gerrit.wikimedia.org/r/#/c/353388/1/index.php [15:21:02] 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#3588640 (10Dzahn) [15:22:26] 10Operations, 10MediaWiki-Vagrant, 10Release-Engineering-Team, 10Epic, 10Patch-For-Review: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3588642 (10bd808) >>! In T136429#3588618, @Dzahn wrote: > Meanwhile production is on their way to stretch and jessie is already... [15:22:58] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3588643 (10Verdy_p) >>! In T171392#3588473, @Anomie... [15:23:09] 10Operations, 10MediaWiki-Vagrant, 10Release-Engineering-Team, 10Epic, and 2 others: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3588644 (10bd808) 05Open>03Resolved a:03bd808 Closing the tracker task to reduce confusion about the state of the default VMs. [15:25:03] 10Operations: add support to offboard-user to support mailman list removal - https://phabricator.wikimedia.org/T161566#3588667 (10Dzahn) [15:25:05] 10Operations, 10Office-IT, 10LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3588665 (10Dzahn) [15:25:57] !log mobrovac@tin Started deploy [restbase/deploy@77961d0]: Revert using MCS for the summary end point - T168848 [15:26:01] 10Operations, 10monitoring: Icinga monitoring for Yubikey components - https://phabricator.wikimedia.org/T151048#3588673 (10Dzahn) [15:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:10] T168848: Bootstrap an initial version of the Page Summary API in MCS - https://phabricator.wikimedia.org/T168848 [15:28:10] 10Operations, 10MediaWiki-Vagrant, 10Release-Engineering-Team, 10Epic, and 2 others: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3588681 (10Dzahn) Thanks! A task for tracking migration of prod MW servers to stretch has recently been opened at T174431. [15:29:31] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3588684 (10jcrespo) a:05faidon>03Cmjohnson Hi, @Cmjohnson We definitely a "drain flea power" on es1019, it does not reboot and is unrespons... [15:32:14] !log mobrovac@tin Finished deploy [restbase/deploy@77961d0]: Revert using MCS for the summary end point - T168848 (duration: 06m 18s) [15:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:27] T168848: Bootstrap an initial version of the Page Summary API in MCS - https://phabricator.wikimedia.org/T168848 [15:32:57] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - restbase_7231 - Could not depool server restbase2011.codfw.wmnet because of too many down! [15:33:57] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [15:35:11] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3588712 (10Eevans) >>! In T169939#3587601, @fgiunchedi wrote: > The cassandra 3 in production is indeed up now, a couple of followups left to do in... [15:38:35] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#3588721 (10Jdforrester-WMF) [15:56:37] (03PS7) 10ArielGlenn: convert "other" dump crons to use script to grab config settings [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) [15:58:33] (03CR) 10ArielGlenn: [C: 032] convert "other" dump crons to use script to grab config settings [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) (owner: 10ArielGlenn) [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T1600). [16:00:30] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3588777 (10Anomie) >>! In T171392#3588643, @Verdy_p... [16:07:44] moritzm: Hey, when you have a moment would you be able to do https://phabricator.wikimedia.org/T174477 (convert staging cluster videoscalers over to jessie, like you already did for prod)? [16:09:44] (03PS4) 10Chad: Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [16:09:50] (03CR) 10Chad: [C: 032] Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [16:11:27] (03Merged) 10jenkins-bot: Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [16:11:41] (03CR) 10jenkins-bot: Remove Extension:RelatedSites from zhwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375971 (https://phabricator.wikimedia.org/T174979) (owner: 10Jayprakash12345) [16:13:35] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Remove Extension:RelatedSites from zhwikivoyage (duration: 00m 50s) [16:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:19] 10Operations, 10monitoring: Review check_ping settings - https://phabricator.wikimedia.org/T173315#3588873 (10herron) Check_fping looks to be faster at multiple pings than check_ping when testing in a vagrant box. Nice! ``` jessie:~/check_fping# time ./check_fping -n 5 -H 10.0.0.1 OK - 10.0.0.1: loss 0%, rta... [16:20:43] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Epic, 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206#3588876 (10GWicke) [16:20:59] bleh, new phab versioning makes the notification dropdown take up more room than before =P [16:21:39] robh: Did anything else change? [16:21:51] i havent noticed much yet [16:22:01] i meant our current deployed version [16:22:16] on tuesday the notifications in the drop down only were a single line, now they line wrap [16:22:25] robh they lowered the number shown in the drop down notification [16:22:27] Yeah. [16:22:32] fits less in the drop down... this is hopefully the bottom point of my day ;D [16:22:41] I just ignore that entirely so it's fine by me. [16:22:48] I came back to 156 notifications from a single day off =P [16:22:58] I use them extensively for dc ops, heh [16:23:08] but oh well, ill just pull up the actual notification page more. [16:39:18] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:39:44] 10Operations, 10Research, 10The-Wikipedia-Library, 10Traffic, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#3588918 (10Nemo_bis) >>! In T87276#3560795, @DarTar wrote: > has been in effect and confirmed by variou... [16:55:28] 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3588936 (10mark) Approved. [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T1700). Please do the needful. [17:01:33] 10Operations, 10Release-Engineering-Team (Watching / External): setup/install/deploy boron as deployment server - https://phabricator.wikimedia.org/T175288#3588945 (10RobH) [17:01:50] 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#3562461 (10RobH) 05Open>03Resolved Setup of this system will be on T175288. [17:05:56] 10Operations, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3588971 (10RobH) [17:07:48] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:09:14] (03PS1) 10Andrew Bogott: Remove presumed typo from hiera def of profile::openstack::main::monitor::spread_check_password [labs/private] - 10https://gerrit.wikimedia.org/r/376547 [17:10:06] (03CR) 10Andrew Bogott: [V: 032 C: 032] Remove presumed typo from hiera def of profile::openstack::main::monitor::spread_check_password [labs/private] - 10https://gerrit.wikimedia.org/r/376547 (owner: 10Andrew Bogott) [17:15:14] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3589016 (10Papaul) @jcrespo db200[1-9} all have 12x300Gb disks we can pull one out and use it for db2010 for now. [17:19:26] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3589027 (10jcrespo) Yes, those are unused, you can use one of those with no problem. Please do if it doesn't take much of your time, thank you. [17:20:28] !log otto@tin Started deploy [analytics/refinery@bdf6754]: (no justification provided) [17:20:37] no justifcation provided! OOPS [17:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:53] !log otto@tin Finished deploy [analytics/refinery@bdf6754]: (no justification provided) (duration: 03m 25s) [17:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:27] (03PS1) 10Andrew Bogott: openstack: add a few more stub passwords to labs-private [labs/private] - 10https://gerrit.wikimedia.org/r/376548 [17:24:39] (03CR) 10Andrew Bogott: [V: 032 C: 032] openstack: add a few more stub passwords to labs-private [labs/private] - 10https://gerrit.wikimedia.org/r/376548 (owner: 10Andrew Bogott) [17:24:44] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3589049 (10Ladsgroup) I made the batch smaller from 100 to 50 and I can do it to 20. Let me make a patch. [17:31:28] !log otto@tin Started deploy [analytics/refinery@bdf6754]: (no justification provided) [17:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:50] !log otto@tin Finished deploy [analytics/refinery@bdf6754]: (no justification provided) (duration: 00m 23s) [17:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:08] !log otto@tin Started deploy [analytics/refinery@6b60f2c]: (no justification provided) [17:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:43] (03CR) 10Andrew Bogott: [C: 031] "Puppet compiler shows some file moves but no other differences." [puppet] - 10https://gerrit.wikimedia.org/r/376531 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:34:28] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3589088 (10bd808) [17:35:31] !log otto@tin Finished deploy [analytics/refinery@6b60f2c]: (no justification provided) (duration: 03m 23s) [17:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:20] (03PS1) 10RobH: deploy1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/376556 (https://phabricator.wikimedia.org/T175288) [17:42:20] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3589125 (10RobH) [17:43:02] (03CR) 10RobH: [C: 032] deploy1001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/376556 (https://phabricator.wikimedia.org/T175288) (owner: 10RobH) [17:57:16] (03PS2) 10Rush: openstack: cleanup keystone references in old module [puppet] - 10https://gerrit.wikimedia.org/r/376531 (https://phabricator.wikimedia.org/T171494) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T1800). [18:01:00] bd808: https://gerrit.wikimedia.org/r/#/c/375046/? :P [18:01:59] Niharika: I'll +2 if you roll it out and make sure it works :) [18:02:29] bd808: Sure, what's the rollout process? [18:02:51] Niharika: https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [18:03:15] bd808: Okay, will do that. [18:03:19] basically, update the git clone on toolforge and restart the bot [18:03:25] (03CR) 10BryanDavis: [C: 032] Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 (owner: 10Niharika29) [18:03:27] Got it. [18:04:05] (03CR) 10Volans: "Minor comment inline" (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 (owner: 10Niharika29) [18:04:09] (03Abandoned) 10Jdlrobson: Disable RelatedSites on English, French and Italian Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335830 (https://phabricator.wikimedia.org/T128326) (owner: 10Jdlrobson) [18:04:20] I'd do something like ./jouncebot.sh update && ./jouncebot.sh restart && ./jouncebot.sh tail [18:04:59] then probably make a window to test it and force a refresh with a "jouncebot: refresh" command here [18:05:04] (03Merged) 10jenkins-bot: Avoid pinging deployers unless there are patches to be deployed [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/375046 (owner: 10Niharika29) [18:05:18] then wait for it to figure out and shout at you [18:06:36] * Niharika nods [18:19:50] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3589302 (10Papaul) @elukey got a call from the Dell manager support team here is what going to happen for the next step. They will send out : 1- Another main board 2 - 2 CPU's 3 - controller panel 4 - controller pa... [18:19:57] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [10.0] [18:20:08] hmmm [18:20:10] ! [18:21:07] Pchelolo: ^^ those look like invalid mediawiki jobs [18:21:26] hm.... [18:21:26] u'null' does not match '^[a-fA-F0-9]{8}(-[a-fA-F0-9]{4}){3}-[a-fA-F0-9]{12}$ [18:21:45] \"request_id\": \"null\ [18:22:16] ottomata: it just started like 10 minutes ago and we didn't see that ever before [18:23:38] request_id is supposed to be filled in by MW, right? but, it shoudl come from a http header set by varnish? [18:23:42] maybe its not being set somewhere? [18:24:34] So the job is `RecordLintJob` [18:25:35] isn't it requestId? I added that param some time ago. it's set unconditionally by the Job constructor [18:26:16] (03PS1) 10Ladsgroup: Reduce wikiPageUpdaterDbBatchSize to 20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376562 (https://phabricator.wikimedia.org/T173710) [18:29:39] that's really weird, eventbus sets the request id unconditionally and generates a new one if it's not available.. how can it be null? [18:35:48] Pchelolo: what's wrong with RecordLintJob? [18:35:57] I just started a script that will queue more of those [18:36:28] legoktm: so it seems your script is setting `x-request-id` to a literal string "null" [18:36:36] uh [18:36:38] or it's getting set somewhere along the way [18:36:56] well my script isn't inserting the job directly [18:37:07] it gets inserted via API requests [18:37:16] but those API requests are only being called by Parsoid, so inside the cluster [18:37:57] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89987.80 seconds [18:38:06] and we are setting passing that id to the EventBus and it fails validation. It's obviously an mistake to pass it along, I'll fix that, but right now it's causing an alert [18:38:08] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [10.0] [18:38:19] this alert ^^ [18:38:24] do you want me to stop? [18:38:51] (03PS2) 10BBlack: browsersec: bump to 14% 2017-09-07 [puppet] - 10https://gerrit.wikimedia.org/r/376310 (https://phabricator.wikimedia.org/T163251) [18:39:19] !log restarted script to reparse all pages in parsoid for Linter (python3 parsoid_reparse.py http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed https://aa.wikipedia.org/w/api.php) - T161556 [18:39:20] it doesn't really matter, I think I will just remove that validation from the schemas for now, and then work on a proper solution [18:39:24] ok [18:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:32] T161556: Implement a way to have linter reprocess all pages - https://phabricator.wikimedia.org/T161556 [18:39:34] (03CR) 10BBlack: [C: 032] browsersec: bump to 14% 2017-09-07 [puppet] - 10https://gerrit.wikimedia.org/r/376310 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [18:40:08] PROBLEM - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [10.0] [18:41:07] Pchelolo: fwiw my code is https://git.legoktm.com/legoktm/parsoid-reparse/src/master/parsoid-reparse/__init__.py#L96 [18:41:39] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3589409 (10Ladsgroup) This patch can go in when commons is on wmf.17. Sooner, it's useless. (See {T174422}) [18:42:03] (03CR) 10Chad: [C: 032] group1 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376305 (owner: 10Chad) [18:43:26] ottomata: let's just disable the request-id validation while I'm working on a proper solution? https://gerrit.wikimedia.org/r/376563 [18:43:37] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.l10n_cache doesnt exist on query. Default database: bawiktionary. [Query snipped] [18:43:44] (03Merged) 10jenkins-bot: group1 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376305 (owner: 10Chad) [18:46:01] (03CR) 10jenkins-bot: group1 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376305 (owner: 10Chad) [18:47:55] !log demon@tin Synchronized php: symlink update -> wmf.17 (duration: 00m 48s) [18:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:28] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.17 [18:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:37] merged Pchelolo [18:52:46] running puppet [18:54:02] (03PS1) 10Herron: WIP: icinga: add check_sysctl.sh script [puppet] - 10https://gerrit.wikimedia.org/r/376566 (https://phabricator.wikimedia.org/T160060) [18:57:48] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:58:12] (03PS1) 10Gehel: adding Priority: optional to metadata [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376568 [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T1900). [19:01:30] I'm entirely unsure where to ask but I was asked if travel funds for one to attend an wmf conference was possible to attain from wmf [19:02:37] RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] [19:03:46] Zppix: every conference I can think of that the Foundation sponsors has some sort of scholarship application process [19:03:59] the process varies by event [19:04:17] bd808: but say if they are not able to afford travel costs can they be gaven money for that? [19:04:44] yes. the scholarships are typically travel + full board [19:04:57] although some events also have partial scholarships [19:05:32] i see [19:05:36] But yeah, all depends on the event. Some have larger budgets than others (meaning scholarship is harder to get). And yeah, process for selection depends by event as well [19:05:51] !log demon@tin Synchronized php-1.30.0-wmf.17/extensions/MobileFrontend/includes/specials/SpecialMobileHistory.php: T175161 (duration: 00m 49s) [19:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:05] T175161: Special:MobileHistory warning: Using deprecated fallback handling for comment rev_comment [Called from CommentStore::getCommentInternal in /Users/jrobson/git/core/includes/CommentStore.php at line 200] - https://phabricator.wikimedia.org/T175161 [19:06:18] (03PS1) 10Thcipriani: Scap: Allow phabricator as a source [puppet] - 10https://gerrit.wikimedia.org/r/376571 [19:06:29] Zppix: as an example, here are docs on the process for the most recent Wikimania -- https://wikimania2017.wikimedia.org/wiki/Scholarships [19:07:19] That "apply in January, find out in May" timeline has been consistent for at least the last 4 Wikimanias [19:07:41] any insight in a diversity event? [19:08:24] the upcoming one in Europe? I haven't heard, but would expect it to be linked from the wiki page if there is a process [19:08:51] (03CR) 10Dzahn: [C: 032] "Ok, i guess there is one way to find out if there are others to exclude, enable it on all and see who alerts :)" [puppet] - 10https://gerrit.wikimedia.org/r/374050 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [19:08:56] * bd808 hasn't heard a lot of details on that event [19:09:14] (03PS2) 10Dzahn: icinga/base: screen monitoring by default. whitelist copper, terbium [puppet] - 10https://gerrit.wikimedia.org/r/374050 (https://phabricator.wikimedia.org/T165348) [19:09:14] me either :/ [19:14:37] (03CR) 10EBernhardson: [C: 031] "based on docs looks reasonable" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376568 (owner: 10Gehel) [19:16:03] (03PS2) 10Jforrester: Enable responsive reference columns on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371630 (https://phabricator.wikimedia.org/T173176) [19:16:04] (03PS1) 10Jforrester: Enable responsive reference columns on Wikitionaries and Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376573 [19:17:55] i'm adding a a new monitoring check on everything, that will detect long-running screen/tmux processes [19:18:17] there will be some false positives, but it will be easy to disable them with Hiera [19:18:26] just adding it on all first to see the special cases [19:18:32] will not page or anything [19:19:36] (T165348) [19:19:36] T165348: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348 [19:20:05] (03PS1) 10RobH: set deploy1001 isntall params [puppet] - 10https://gerrit.wikimedia.org/r/376576 (https://phabricator.wikimedia.org/T175288) [19:20:32] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3589584 (10RobH) [19:22:49] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3589585 (10RobH) I'm going to install with jessie, like tin and naos both presently have. I am guessing that an attempt to stre... [19:23:52] (03PS2) 10RobH: set deploy1001 isntall params [puppet] - 10https://gerrit.wikimedia.org/r/376576 (https://phabricator.wikimedia.org/T175288) [19:25:42] (03CR) 10RobH: [C: 032] set deploy1001 isntall params [puppet] - 10https://gerrit.wikimedia.org/r/376576 (https://phabricator.wikimedia.org/T175288) (owner: 10RobH) [19:27:12] PROBLEM - Long running screen/tmux on db1098 is CRITICAL: CRIT: Long running SCREEN process. (PID: 3880, 2687938s 10800s). [19:29:41] nice ^:) that's the new check [19:29:51] it found one, so it works [19:30:22] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.l10n_cache doesnt exist on query. Default database: bawiktionary. [Query snipped] [19:30:42] PROBLEM - Long running screen/tmux on cp2004 is CRITICAL: CRIT: Long running SCREEN process. (PID: 42194, 4323807s 10800s). [19:31:41] PROBLEM - Long running screen/tmux on elastic1020 is CRITICAL: CRIT: Long running tmux process. (PID: 2779, 2063916s 10800s). [19:32:32] PROBLEM - Long running screen/tmux on einsteinium is CRITICAL: CRIT: Long running SCREEN process. (PID: 1514, 6672512s 10800s). [19:32:36] interesting.. let's see how many [19:33:27] oh nice check! [19:33:29] i like it! [19:35:52] 10Operations, 10monitoring: Review check_ping settings - https://phabricator.wikimedia.org/T173315#3589598 (10Dzahn) I think just a single ping is not enough and it would just be about reducing it to 2 or 3 packets instead of 5. [19:38:33] (03PS1) 10Dzahn: Revert "icinga/base: screen monitoring by default. whitelist copper, terbium" [puppet] - 10https://gerrit.wikimedia.org/r/376577 [19:38:58] (03CR) 10Dzahn: [C: 032] "requested by jynus" [puppet] - 10https://gerrit.wikimedia.org/r/376577 (owner: 10Dzahn) [19:40:06] (03CR) 10Dzahn: [V: 032 C: 032] Revert "icinga/base: screen monitoring by default. whitelist copper, terbium" [puppet] - 10https://gerrit.wikimedia.org/r/376577 (owner: 10Dzahn) [19:40:15] (03PS2) 10Dzahn: Revert "icinga/base: screen monitoring by default. whitelist copper, terbium" [puppet] - 10https://gerrit.wikimedia.org/r/376577 [19:43:59] RECOVERY - Long running screen/tmux on elastic1020 is OK: OK: Tmux detected but not long running. [19:44:29] PROBLEM - Long running screen/tmux on elastic1017 is CRITICAL: CRIT: Long running tmux process. (PID: 632, 2064684s 10800s). [19:44:29] PROBLEM - Long running screen/tmux on elastic1024 is CRITICAL: CRIT: Long running tmux process. (PID: 565, 2064684s 10800s). [19:44:29] PROBLEM - Long running screen/tmux on labstore2003 is CRITICAL: CRIT: Long running SCREEN process. (PID: 3732, 25697215s 10800s). [19:44:49] interesting new alert [19:45:19] PROBLEM - Long running screen/tmux on db1079 is CRITICAL: CRIT: Long running SCREEN process. (PID: 1409, 1331978s 10800s). [19:45:19] PROBLEM - Long running screen/tmux on db1038 is CRITICAL: CRIT: Long running SCREEN process. (PID: 31745, 21381994s 10800s). [19:45:19] PROBLEM - Long running screen/tmux on db2017 is CRITICAL: CRIT: Long running SCREEN process. (PID: 10297, 10411440s 10800s). [19:45:19] PROBLEM - Long running screen/tmux on install2002 is CRITICAL: CRIT: Long running SCREEN process. (PID: 31371, 1160920s 10800s). [19:45:20] PROBLEM - Long running screen/tmux on mwlog1001 is CRITICAL: CRIT: Long running SCREEN process. (PID: 118104, 12200342s 10800s). [19:45:29] PROBLEM - Long running screen/tmux on sarin is CRITICAL: CRIT: Long running SCREEN process. (PID: 25505, 705730s 10800s). [19:46:00] PROBLEM - Long running screen/tmux on db1102 is CRITICAL: CRIT: Long running SCREEN process. (PID: 22820, 3314008s 10800s). [19:46:00] PROBLEM - Long running screen/tmux on db1075 is CRITICAL: CRIT: Long running SCREEN process. (PID: 21735, 11711567s 10800s). [19:46:00] PROBLEM - Long running screen/tmux on elastic1025 is CRITICAL: CRIT: Long running tmux process. (PID: 17811, 2064777s 10800s). [19:46:00] PROBLEM - Long running screen/tmux on elastic1030 is CRITICAL: CRIT: Long running tmux process. (PID: 2806, 2064777s 10800s). [19:46:00] PROBLEM - Long running screen/tmux on db2072 is CRITICAL: CRIT: Long running SCREEN process. (PID: 11000, 4157216s 10800s). [19:46:00] PROBLEM - Long running screen/tmux on db2041 is CRITICAL: CRIT: Long running SCREEN process. (PID: 9458, 8396885s 10800s). [19:46:09] PROBLEM - Long running screen/tmux on netmon1002 is CRITICAL: CRIT: Long running SCREEN process. (PID: 10755, 4296614s 10800s). [19:46:09] PROBLEM - Long running screen/tmux on wdqs2001 is CRITICAL: CRIT: Long running tmux process. (PID: 9045, 6304394s 10800s). [19:46:43] i suppose that might catch things that are either long running and should have been puppetized, or those of us that might just forget [19:46:53] PROBLEM - Long running screen/tmux on db1086 is CRITICAL: CRIT: Long running SCREEN process. (PID: 6073, 38834396s 10800s). [19:46:53] PROBLEM - Long running screen/tmux on db1090 is CRITICAL: CRIT: Long running SCREEN process. (PID: 20482, 38922337s 10800s). [19:46:59] PROBLEM - Long running screen/tmux on labstore2004 is CRITICAL: CRIT: Long running SCREEN process. (PID: 11854, 26652024s 10800s). [19:47:00] PROBLEM - Long running screen/tmux on stat1004 is CRITICAL: CRIT: Long running tmux process. (PID: 6662, 3818003s 10800s). [19:47:00] PROBLEM - Long running screen/tmux on elastic1020 is CRITICAL: CRIT: Long running tmux process. (PID: 2779, 2064834s 10800s). [19:47:00] PROBLEM - Long running screen/tmux on neodymium is CRITICAL: CRIT: Long running SCREEN process. (PID: 29891, 132176s 10800s). [19:47:00] PROBLEM - Long running screen/tmux on wdqs2002 is CRITICAL: CRIT: Long running tmux process. (PID: 15508, 6642878s 10800s). [19:47:00] PROBLEM - Long running screen/tmux on thumbor2001 is CRITICAL: CRIT: Long running SCREEN process. (PID: 32988, 4937360s 10800s). [19:47:49] PROBLEM - Long running screen/tmux on db2018 is CRITICAL: CRIT: Long running SCREEN process. (PID: 26368, 9896116s 10800s). [19:47:50] PROBLEM - Long running screen/tmux on labsdb1001 is CRITICAL: CRIT: Long running SCREEN process. (PID: 32291, 39864216s 10800s). [19:47:50] PROBLEM - Long running screen/tmux on phab2001 is CRITICAL: CRIT: Long running SCREEN process. (PID: 14647, 2921175s 10800s). [19:47:50] PROBLEM - Long running screen/tmux on tin is CRITICAL: CRIT: Long running SCREEN process. (PID: 41626, 856878s 10800s). [19:47:50] PROBLEM - Long running screen/tmux on wdqs1001 is CRITICAL: CRIT: Long running tmux process. (PID: 3267, 5104729s 10800s). [19:47:50] PROBLEM - Long running screen/tmux on wdqs1003 is CRITICAL: CRIT: Long running tmux process. (PID: 12591, 5006292s 10800s). [19:48:31] ebernhardson: exactly, that was the purpose [19:48:35] it's being reverted though [19:48:43] killed the bot to stop spam [19:56:39] PROBLEM - Long running screen/tmux on elastic1029 is CRITICAL: CRIT: Long running tmux process. (PID: 33911, 2065397s 10800s). [19:57:07] heh [19:57:10] PROBLEM - Long running screen/tmux on db1085 is CRITICAL: CRIT: Long running SCREEN process. (PID: 17361, 38191834s 10800s). [19:57:20] can it call out the user who owns the login session too? :) [19:57:51] but probably the threshold should be a bit higher than 10800 [19:58:06] maybe something more like 6h? [19:58:29] oh, yea, easy to adjust , totally [19:58:39] hell even 24h :) [19:58:49] (there's a lot of legit cases for keeping a screen session up for most of a working day while waiting on system reimages or taking long log samples for investigation or whatever) [19:58:49] i wasnt even sure if we are talking minutes, hours or days [19:58:54] when we say "long" [19:59:25] but you've got some example alerts up there that are months [19:59:30] so there's definitely a need heh [20:00:04] Niharika: Dear anthropoid, the time has come. Please deploy Scholarships deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T2000). [20:00:04] 15 months for that one on db1090 heh [20:01:09] o/ jouncebot [20:01:25] ** PROBLEM alert - wdqs1002/Long running screen/tmux is CRITICAL ** <-- anybody knows what is this? [20:02:52] SMalyshev: it's detecting tmux and screen processes that are running for a while [20:03:18] why do we have so many running screens? [20:03:30] SMalyshev: it's being reverted for now, but the purpose is https://phabricator.wikimedia.org/T165348 [20:03:35] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3589657 (10RobH) [20:03:46] Wrong window, MaxSem. Fixed now. [20:04:39] why do we setup the new deployment server with a function1001 name, not a element like before? [20:04:54] Sagan: partially it's just a matter of scale.. we have over 1000 servers, so even a small percentage looks many.. and then it's also some stuff that could be puppetized probably [20:05:08] Sagan: because element names are not useful/easily known and we're reusing them already and that causes confusion [20:05:28] greg-g: sad, I like that kind of naming more :o [20:05:54] it doesn't scale, it's great for your home network, but not a huge operation like WMF :) [20:05:54] greg-g: and I adapted it for my own and my labs servers as well [20:06:00] :o [20:06:02] greg-g: Can we have a more fun name for mwdebug1002? :P [20:06:08] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3589662 (10RobH) a:05RobH>03Cmjohnson Ok, I've pinged Chris about this in IRC, and assigning this to him for some followup:... [20:06:17] Niharika: whattheheckiswrong1001? [20:06:35] doesthiswork1001? [20:06:41] :D That works. [20:06:51] failure1002 [20:06:52] 911greg1001 [20:07:04] yayerrors1001 [20:07:12] goatland? [20:07:21] (03PS3) 10Rush: openstack: cleanup keystone references in old module [puppet] - 10https://gerrit.wikimedia.org/r/376531 (https://phabricator.wikimedia.org/T171494) [20:07:36] (03CR) 10Ottomata: [WIP] Initial commit of certpy (0310 comments) [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [20:07:45] Hi there. Your changes are live in goatland. Please test. [20:08:00] oh, right. staging = Goat Simulator [20:08:13] (03CR) 10Rush: [C: 032] openstack: cleanup keystone references in old module [puppet] - 10https://gerrit.wikimedia.org/r/376531 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [20:08:16] and now we have a new thing for the bash tool :o [20:08:24] Oh no, you just killed all the goats. [20:08:57] (03PS8) 10Ottomata: Initial commit of cergen [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) [20:09:54] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3589673 (10jcrespo) I think we should not waste icinga cycles on checking long running screens every some time. This would be ok as a report/email. We ignore emails? Is it worth... [20:14:14] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3589716 (10ovasileva) [20:14:58] 10Operations, 10Traffic: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202#3589720 (10BBlack) I've re-done some of the sampled/informal `AES128-SHA` analysis from before, since it hasn't been done in about a year, and the past results were never recorded in detail. This is... [20:25:52] (03PS1) 10MaxSem: Enable ACW on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376583 (https://phabricator.wikimedia.org/T175302) [20:32:54] (03PS2) 10MaxSem: Enable ACW on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376583 (https://phabricator.wikimedia.org/T175302) [20:33:07] (03PS2) 10Thcipriani: Deploy iegreview with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) (owner: 10BryanDavis) [20:33:49] (03CR) 10jerkins-bot: [V: 04-1] Deploy iegreview with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) (owner: 10BryanDavis) [20:35:52] (03PS3) 10Thcipriani: Deploy iegreview with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) (owner: 10BryanDavis) [20:37:22] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3589768 (10Dzahn) >>! In T165348#3589673, @jcrespo wrote: > I think we should not waste icinga cycles on checking long running screens every some time. I think Icinga is the t... [20:40:05] (03CR) 10Thcipriani: [C: 031] "Realized this repo is now on phabricator so we'll need to ensure origin is set to phabricator to make sure git repo origin is set correctl" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) (owner: 10BryanDavis) [20:42:18] (03CR) 10Dzahn: "oops, thanks for testing!" [puppet] - 10https://gerrit.wikimedia.org/r/368196 (owner: 10Paladox) [20:44:46] (03PS9) 10Ottomata: Initial commit of cergen [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) [20:47:32] (03PS4) 10BryanDavis: Deploy iegreview with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) [20:48:52] (03CR) 10BryanDavis: Deploy iegreview with scap3 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) (owner: 10BryanDavis) [20:53:34] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3589791 (10Pchelolo) [20:54:44] (03PS8) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) [20:54:50] (03PS10) 10Paladox: Gerrit: Enable logstash by default for prod gerrit [puppet] - 10https://gerrit.wikimedia.org/r/332531 (https://phabricator.wikimedia.org/T141324) [20:54:56] (03PS9) 10Paladox: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 [20:55:23] (03CR) 10Thcipriani: [C: 04-1] "one last thing: phab repo url vs gerrit repo url" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) (owner: 10BryanDavis) [20:55:26] (03PS3) 10Paladox: Gerrit: Remove velocity templates but keep the ones for its-base [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) [20:55:49] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Remove velocity templates but keep the ones for its-base [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) (owner: 10Paladox) [20:56:24] (03PS4) 10Paladox: Gerrit: Remove velocity templates but keep the ones for its-base [puppet] - 10https://gerrit.wikimedia.org/r/353693 (https://phabricator.wikimedia.org/T158008) [20:56:52] (03PS12) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [20:57:20] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 (owner: 10Paladox) [20:57:54] (03PS13) 10Paladox: Gerrit: Set gitBasicAuthPolicy to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/350484 [20:59:32] (03PS5) 10BryanDavis: Deploy iegreview with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) [21:00:04] Niharika: Respected human, time to deploy Scholarships deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T2100). Please do the needful. [21:00:20] !log T169940: Disabling changeprop in RESTBase dev environment [21:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:35] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [21:01:26] (03PS10) 10Ottomata: Initial commit of cergen [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) [21:01:27] (03CR) 10Thcipriani: [C: 031] Deploy iegreview with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/375112 (https://phabricator.wikimedia.org/T129154) (owner: 10BryanDavis) [21:01:32] !log T169940: Disabling puppet and restbase service in RESTBase dev [21:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:08] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:05:08] PROBLEM - Restbase root url on restbase-dev1006 is CRITICAL: connect to address 10.64.48.10 and port 7231: Connection refused [21:05:17] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:05:20] sorry, got this ^^^ [21:05:28] PROBLEM - Restbase root url on restbase-dev1005 is CRITICAL: connect to address 10.64.16.96 and port 7231: Connection refused [21:05:47] PROBLEM - Restbase root url on restbase-dev1004 is CRITICAL: connect to address 10.64.0.89 and port 7231: Connection refused [21:05:47] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:07:37] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Disabled while reinitializing (T169940) [21:07:38] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1004 is CRITICAL: connect to address 10.64.0.89 and port 7231: Connection refused eevans Disabled while reinitializing (T169940) [21:07:38] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Disabled while reinitializing (T169940) [21:07:38] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1005 is CRITICAL: connect to address 10.64.16.96 and port 7231: Connection refused eevans Disabled while reinitializing (T169940) [21:07:38] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans Disabled while reinitializing (T169940) [21:07:38] ACKNOWLEDGEMENT - Restbase root url on restbase-dev1006 is CRITICAL: connect to address 10.64.48.10 and port 7231: Connection refused eevans Disabled while reinitializing (T169940) [21:09:11] (03PS1) 10Dzahn: base/icinga: increase warn/crit thresholds for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/376590 (https://phabricator.wikimedia.org/T165348) [21:10:34] (03CR) 10jerkins-bot: [V: 04-1] base/icinga: increase warn/crit thresholds for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/376590 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [21:12:04] how's that even possible when changing a number ... eh [21:12:44] oh, the commit message, duh [21:13:43] (03PS2) 10Dzahn: base/icinga: increase warn/crit thresholds for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/376590 (https://phabricator.wikimedia.org/T165348) [21:14:03] (03CR) 10Niharika29: [C: 032] Enable ACW on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376583 (https://phabricator.wikimedia.org/T175302) (owner: 10MaxSem) [21:15:23] (03CR) 10Ottomata: "Ok! Volans, i've responded to all of your initial comments in the newer patchset." [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:15:48] (03Abandoned) 10Ottomata: Initial commit of cergen [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:16:52] (03Merged) 10jenkins-bot: Enable ACW on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376583 (https://phabricator.wikimedia.org/T175302) (owner: 10MaxSem) [21:17:02] (03CR) 10jenkins-bot: Enable ACW on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376583 (https://phabricator.wikimedia.org/T175302) (owner: 10MaxSem) [21:17:10] (03CR) 10Dzahn: [C: 032] "this is currently not applied (except on a single test host that opts-in), i just wanted to let you reviewers know i am raising the limit " [puppet] - 10https://gerrit.wikimedia.org/r/376590 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [21:19:25] (03CR) 10Dzahn: "and it was partially to point out where the setting is, we can also set it even higher" [puppet] - 10https://gerrit.wikimedia.org/r/376590 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [21:22:59] (03PS1) 10Chad: group2 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376606 [21:23:52] !log demon@tin Synchronized php-1.30.0-wmf.17/includes/api/ApiQueryRecentChanges.php: T175307 (duration: 00m 49s) [21:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:05] T175307: "$row does not contain fields needed for comment rc_comment" from list=recentchanges - https://phabricator.wikimedia.org/T175307 [21:26:44] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3589877 (10Dzahn) a:05ema>03Dzahn [21:27:40] 10Operations, 10Ops-Access-Requests: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3586102 (10Dzahn) @RHo Looks good! thank you. I'll prepare a code change to add your account and upload it to code review. [21:38:12] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3589884 (10Pchelolo) [21:40:02] !log niharika29@tin Started scap: Deploying ArticleCreation Workflow on test2 wiki TT175302 [21:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:17] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2028959 [21:51:38] (03CR) 10Volans: "I've done a quick pass, see some comments inline." (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/376566 (https://phabricator.wikimedia.org/T160060) (owner: 10Herron) [21:53:58] (03CR) 10Chad: [C: 032] group2 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376606 (owner: 10Chad) [21:55:38] (03Merged) 10jenkins-bot: group2 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376606 (owner: 10Chad) [21:56:32] jouncebot: Now [21:56:32] For the next 0 hour(s) and 3 minute(s): Scholarships deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T2100) [21:56:53] (03CR) 10jenkins-bot: group2 to wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376606 (owner: 10Chad) [21:57:00] Niharika: Um..... [21:57:18] @jouncebot: next [21:57:24] no_justification: Hmm? Your window was up 57 minutes ago, sire. :P [21:57:26] jouncebot: next [21:57:26] In 0 hour(s) and 2 minute(s): ArticleCreationWorkflow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T2200) [21:57:27] jouncebot: next [21:57:27] In 0 hour(s) and 2 minute(s): ArticleCreationWorkflow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170907T2200) [21:57:31] Yes. Your window did not include scap [21:57:46] sorry, we just started the next window a bit earlier [21:57:53] no_justification: I did MaxSem's work as part of my window too. So almost over. [21:58:00] Ok. [21:58:38] (03PS1) 10Dzahn: icinga/base: turn screen monitoring into a WARN-only check [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) [22:00:05] No patches in the queue for this window. Wheeee! [22:00:25] Hah, today I learned that the deploy calendar highlights both rows when it's *right* on the verge of the window [22:00:42] Meh. Not what you were supposed to say, jouncebot. [22:01:53] (03CR) 10jerkins-bot: [V: 04-1] icinga/base: turn screen monitoring into a WARN-only check [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [22:03:02] !log niharika29@tin Finished scap: Deploying ArticleCreation Workflow on test2 wiki TT175302 (duration: 23m 00s) [22:03:06] no_justification: I'm done. I have a patch that's merged and needs to go out on test2 wiki. I suppose the train will take care of that? If not I can do it after. [22:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:39] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3589982 (10Pchelolo) [22:05:35] jenkins-bot, what do you mean not FAIL or OK but ABORTED [22:07:05] Niharika: No, train will not [22:07:11] All I'm doing is wikiversions.json updates [22:07:19] Go ahead and get your patch out..... [22:07:26] (03CR) 10Dzahn: "what's up with CI saying "22:01:51 Finished: ABORTED" here instead of a FAIL or OK? it's also grey color instead of red or green, i don't " [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [22:07:51] mutante: run recheck [22:08:10] if it doesnt work ill see if i cant figure it out mutante [22:08:17] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [22:08:23] Zppix: 'k:) [22:08:28] thx [22:08:30] np [22:08:31] \ [22:10:44] mutante: it worked now [22:11:14] my guess the test timed out mutante [22:12:06] indeed it did, glitch :) [22:12:26] Zpi sounds likely, exactly after 3m, yea [22:14:32] mutante: if it does it again i'd either let releng know or open a task just to let them know docker is doing this (docker is something we're expiermenting with atm) [22:15:46] Zppix: yes, sounds right, ack [22:23:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:24:25] (03CR) 10Dzahn: "to review this you can check that the files i am adding are identical and mere copies of squid.php and squid-labs.php. that's all basicall" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376337 (https://phabricator.wikimedia.org/T104148) (owner: 10Dzahn) [22:24:48] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:25:05] MaxSem: Hmm....."Current branch wmf/1.30.0-wmf.17 is up to date." [22:26:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:28:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:28:53] uhm, i'll assume that is deployment related and just a spike [22:29:00] for now [22:36:57] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:37:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:37:25] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3590042 (10Dzahn) As suggested by @jcrespo i'll run a pre-report of hosts that will alert and send it to list, so we can sort out what should be whitelisted before enabling this... [22:37:27] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:38:00] yep, good [22:38:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:40:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:41:23] !log niharika29@tin Synchronized php-1.30.0-wmf.17/extensions/ArticleCreationWorkflow/: Bump extension version (duration: 00m 49s) [22:41:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [22:41:27] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [22:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:07] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [22:42:14] (03PS1) 10Nuria: Add cron to purge old mediawiki data snapshots [puppet] - 10https://gerrit.wikimedia.org/r/376640 (https://phabricator.wikimedia.org/T162034) [22:42:40] (03CR) 10jerkins-bot: [V: 04-1] Add cron to purge old mediawiki data snapshots [puppet] - 10https://gerrit.wikimedia.org/r/376640 (https://phabricator.wikimedia.org/T162034) (owner: 10Nuria) [22:46:28] (03PS2) 10Nuria: Add cron to purge old mediawiki data snapshots [puppet] - 10https://gerrit.wikimedia.org/r/376640 (https://phabricator.wikimedia.org/T162034) [22:46:54] (03CR) 10jerkins-bot: [V: 04-1] Add cron to purge old mediawiki data snapshots [puppet] - 10https://gerrit.wikimedia.org/r/376640 (https://phabricator.wikimedia.org/T162034) (owner: 10Nuria) [22:47:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:47:57] (03PS1) 10Dzahn: admins: create user account for Rita Ho (rho) [puppet] - 10https://gerrit.wikimedia.org/r/376641 (https://phabricator.wikimedia.org/T175204) [22:50:07] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:50:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:50:37] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:00:04] No patches in the queue for this window. Wheeee! [23:06:30] i love it [23:06:33] :P [23:07:17] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [23:15:27] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [23:16:57] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [23:17:37] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [23:18:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [23:19:11] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.17 [23:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:29] hi [23:23:02] bblack: so i called right before that next log line .. [23:23:27] so there is a running upgrade that i didnt consider [23:23:28] yeah, so [23:23:33] it's text cluster 5xx spikes [23:23:48] they don't look very normal, is there something to revert, or? [23:24:06] they are 503s as opposed to 500s [23:24:08] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:24:08] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:24:12] earlier there were the spikes i considered normal . that went up to just 10% and recovered [23:24:15] after the depkoy [23:24:23] but then there were the additional ones with 50% [23:24:26] but I didn't think our push process was that noisy on causing problems [23:24:37] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:24:37] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:24:42] i'm sorry if it was still just normal spikes, yea [23:24:43] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=1504821102469&to=1504826599346 [23:24:47] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:24:47] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:24:54] it seems problematic to me [23:24:57] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:24:58] ok [23:25:07] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [23:25:10] started picking up around 22:20, it's been building for a while? [23:25:53] there's been scaps going on since then? [23:26:22] no_justification: are you scapping right now [23:26:35] All I did was a sync-wikiversions [23:27:07] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:27:21] saw .. hm.. "Synchronized php-1.30.0-wmf.17/extensions/ArticleCreationWorkflow/: Bump extension version" [23:27:26] That was not me [23:27:30] Niharika: ^^ [23:27:44] but it just recovered too.. [23:27:47] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:27:49] no_justification: Yeah. I did that. [23:27:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:28:02] Something blew up? Do I get a tee now? [23:28:24] yes, something is gently blowing up since a little over an hour ago [23:28:38] the first indicators of the issue are around 22:20, but it's intermittent [23:28:42] bblack: Um, we added an extension and enabled it on test2 wiki. [23:28:44] i was simply looking at the last syncs [23:28:50] from /last logmsgbot [23:28:58] but yea, it also said "test" [23:29:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:29:43] yeah to recap those: [23:29:46] 21:23 < logmsgbot> !log demon@tin Synchronized php-1.30.0-wmf.17/includes/api/ApiQueryRecentChanges.php: T175307 (duration: 00m 49s) [23:29:47] T175307: "$row does not contain fields needed for comment rc_comment" from list=recentchanges - https://phabricator.wikimedia.org/T175307 [23:29:49] 21:40 < logmsgbot> !log niharika29@tin Started scap: Deploying ArticleCreation Workflow on test2 wiki TT175302 [23:29:51] ApiQueryRecentChanges.php before that? [23:29:52] 22:03 < logmsgbot> !log niharika29@tin Finished scap: Deploying ArticleCreation Workflow on test2 wiki TT175302 (duration: 23m 00s) [23:29:55] 22:41 < logmsgbot> !log niharika29@tin Synchronized php-1.30.0-wmf.17/extensions/ArticleCreationWorkflow/: Bump extension version (duration: 00m 49s) [23:29:58] 23:19 < logmsgbot> !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.17 [23:30:10] the apiqueryrecentchanges thing is about 1h before the issue picks up, but sure it could be [23:30:27] bblack, doesn't look like it's coming from MW: https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=h@44136fa&_a=h@15c1f8f [23:30:41] well the spikes are 503s, not 500s [23:30:59] so if it's MW, it's not necessarily something MW's logging, it could be the MW processes are stalling out or failing [23:32:02] hhvm logs are also aggregated there, so we'd see if there were events that hhvm notices [23:32:29] e.g. we saw all those LightProcess crashes [23:32:48] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [23:32:57] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:32:57] RECOVERY - DPKG on stat1005 is OK: All packages OK [23:33:08] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [23:33:17] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [23:33:27] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [23:33:27] RECOVERY - Disk space on stat1005 is OK: DISK OK [23:33:47] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [23:34:56] 19:14 < slakr> mine was > Request from 174.133.54.58 via cp1066 cp1066, Varnish XID 747700319 Error: 503, Backend fetch failed at Thu, 07 Sep 2017 23:13:23 GMT [23:35:01] from #wikimedia-tech earlier [23:36:08] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [23:36:16] half the 503s are on /w/api.php reqs [23:36:47] does that makes ApiQueryRecentChanges.php more likely? [23:37:16] I don't know [23:37:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [23:37:48] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [23:38:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [23:38:27] it seems the only sync during that timeframe though as well.. would sync 1.30.0-wmf.17 influence prod like that [23:38:40] no_justification: ^ [23:38:48] could still be other things, too, still digging around [23:38:52] Hmm. [23:39:00] when I hear recentchanges I think DB timeouts, but doesn't look like so [23:39:56] We can roll back ApiQueryRecentChanges back if need be. cc anomie [23:40:56] no, I found a real lead, this could be a failure in a cache node [23:41:26] almost all the 503s in the batches I've looked at so far, there's mostly a singular eqiad backend cache node implicated, which is trying to communicate to MW [23:41:31] so probably, that node [23:42:00] both user reports on the other channel mentioned cp1066 [23:42:12] yeah that one [23:42:35] !log cp1066 - depooled all traffic services, seems to have some bug causing 503 spikes [23:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:44:08] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:45:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:45:53] weee? [23:45:57] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:45:59] :) [23:47:18] * mutante thinks 'wait a minute, cp1066, isnt't that cp10666 the one that always broke.. searches phabricator.. finds "cp1066.mgmt.eqiad.wmnet is unreachable" "cp1066 troubleshoot IPMI issue" ,. [23:47:58] yeah, I'm not sure what's going on there [23:48:05] there's no hard crashes in history that I've found [23:48:19] there are some weird udev task hangs near the last bootup ~49 days ago [23:49:22] it sounds like "just reboot it" and that would actually fix it but also be sad [23:49:42] yeah I doubt it, it seems like something's hardware-wrong, but perhaps subtly? [23:50:17] in the detailed varnish stats, you can see anomalous patterns start developing as far back as ~16:00 [23:50:18] well yea, i mean that history of other issues, did it have mainboard replaced before [23:50:23] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1066&var-datasource=eqiad%20prometheus%2Fops [23:50:36] (in the memory, the storage, etc) [23:50:58] network i/o too [23:51:48] hmmm maybe 16:00 was just a process restart [23:54:45] when looking at those tickets in more detail.. they were both resolved after reseating cables.. not actually one of those board replacements or any of that [23:59:24] sometimes, it's not the host but the traffic chashing, of course [23:59:33] in which case this may recur and start implicating another specific backend [23:59:52] which really means it's fetches of a particular URL that are causing havoc for whichever varnishd they're chashed to