[00:00:04] twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180503T0000). [00:02:49] !log no phabricator upgrade tonight. [00:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:25] (03CR) 10Krinkle: Forward response codes >= 400 on search.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430502 (https://phabricator.wikimedia.org/T179266) (owner: 10EBernhardson) [00:14:04] RoanKattouw: sanity check to make sure scap is free? [00:14:31] Yup go for it [00:15:48] RoanKattouw: btw, when you have a minute - https://gerrit.wikimedia.org/r/429124 and/or https://gerrit.wikimedia.org/r/428406 is ready for review. The other commits after this we can probably peer-review within the team. [00:20:36] !log mw2180,mw2181,mw2182 - reinstalling with stretch (in case there are alerts that's why) [00:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:59] Staging on mwdebug1002 [00:33:49] !log krinkle@tin Synchronized php-1.32.0-wmf.1/extensions/NavigationTiming/modules/: Ie77e77de3b8 (duration: 01m 18s) [00:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:04] (03PS1) 10Ayounsi: Add loopback IPs for cr3/4-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/430517 (https://phabricator.wikimedia.org/T189552) [00:35:10] (03CR) 10Ayounsi: [C: 032] Add loopback IPs for cr3/4-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/430517 (https://phabricator.wikimedia.org/T189552) (owner: 10Ayounsi) [00:39:40] (03PS1) 10Dzahn: rename wmf6936 from mw1297 to mwmaint1001 [dns] - 10https://gerrit.wikimedia.org/r/430518 (https://phabricator.wikimedia.org/T192457) [00:40:43] (03PS2) 10Dzahn: rename wmf6936 from mw1297 to mwmaint1001 [dns] - 10https://gerrit.wikimedia.org/r/430518 (https://phabricator.wikimedia.org/T192457) [00:48:25] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4177075 (10Dzahn) 05stalled>03Open [00:54:27] (03PS1) 10Dzahn: rename mw1297 to mwmaint1001, assign mw-maint role [puppet] - 10https://gerrit.wikimedia.org/r/430519 (https://phabricator.wikimedia.org/T192185) [00:55:08] (03CR) 10jerkins-bot: [V: 04-1] rename mw1297 to mwmaint1001, assign mw-maint role [puppet] - 10https://gerrit.wikimedia.org/r/430519 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn) [01:00:45] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4177088 (10Dzahn) [01:01:31] (03PS1) 10Dzahn: add mwmaint1001 to scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/430521 (https://phabricator.wikimedia.org/T192092) [01:05:32] (03PS1) 10Dzahn: network: add mwmaint1001 to network constants [puppet] - 10https://gerrit.wikimedia.org/r/430522 (https://phabricator.wikimedia.org/T192092) [01:07:44] (03PS1) 10Dzahn: admin: update comments about terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430523 (https://phabricator.wikimedia.org/T192092) [01:12:00] (03PS1) 10Dzahn: mariadb: add mwmaint1001 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) [01:13:25] (03PS2) 10Dzahn: mariadb: add mwmaint1001 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) [01:15:31] (03PS1) 10Dzahn: relforge: adjust terbium comments, rename ferm role to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430526 (https://phabricator.wikimedia.org/T192092) [01:20:01] (03PS2) 10Dzahn: rename mw1297 to mwmaint1001, assign mw-maint role [puppet] - 10https://gerrit.wikimedia.org/r/430519 (https://phabricator.wikimedia.org/T192185) [01:23:10] (03PS1) 10Dzahn: cache::misc: switch noc.wm,dbtree.wm backends to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430527 (https://phabricator.wikimedia.org/T192092) [01:26:50] (03PS2) 10Dzahn: relforge/mariadb-labtest: adjust terbium comments, rename ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/430526 (https://phabricator.wikimedia.org/T192092) [01:33:37] (03PS1) 10Dzahn: tcpircbot: add mwmaint1001 to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/430529 (https://phabricator.wikimedia.org/T192092) [01:33:40] (03PS1) 10Dzahn: tcpircbot: remove terbium from ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/430530 (https://phabricator.wikimedia.org/T192092) [02:24:30] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Elasticsearch: Alert when elasticsearch writes are frozen for too long - https://phabricator.wikimedia.org/T193605#4177176 (10EBernhardson) After https://gerrit.wikimedia.org/r/430441 it will work fairly simply. Each cluster can have the fo... [02:43:24] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.1) (duration: 07m 20s) [02:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:17] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2180.codfw.wmnet [03:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 906.35 seconds [03:28:04] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2181.codfw.wmnet [03:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:21] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2182.codfw.wmnet [03:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 275.36 seconds [04:37:25] (03PS1) 10Zhuyifei1999: maintain_kubeusers.pp: use require_package and add python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/430539 (https://phabricator.wikimedia.org/T190893) [05:22:53] !log Deploy schema change on db1060 with replication (this will generate lag on labs - s2) - T191519 T188299 T190148 [05:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:59] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:23:00] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:23:00] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:23:49] (03PS1) 10Marostegui: db-eqiad.php: Clarify that db1060 is running an later [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430540 [05:24:21] (03CR) 10Marostegui: [C: 032] s3.hosts: Add db1116:3313 [software] - 10https://gerrit.wikimedia.org/r/430410 (owner: 10Marostegui) [05:25:10] (03Merged) 10jenkins-bot: s3.hosts: Add db1116:3313 [software] - 10https://gerrit.wikimedia.org/r/430410 (owner: 10Marostegui) [05:25:33] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Clarify that db1060 is running an later [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430540 (owner: 10Marostegui) [05:27:01] (03Merged) 10jenkins-bot: db-eqiad.php: Clarify that db1060 is running an later [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430540 (owner: 10Marostegui) [05:28:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Clarify that db1060 is running an alter table (duration: 01m 15s) [05:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:09] (03CR) 10jenkins-bot: db-eqiad.php: Clarify that db1060 is running an later [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430540 (owner: 10Marostegui) [05:31:00] !log Drop empty flagged* tables from eswiki (s7) - T193678 [05:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:05] T193678: Drop flaggedrevs tables at eswiki - https://phabricator.wikimedia.org/T193678 [05:39:49] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4177496 (10Marostegui) db1116 is now replicating a multi-instance sanitized copy (also checked with check_private_data) of the... [05:40:29] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4177497 (10Marostegui) [05:57:01] !log reimage analytics10[39,40] to Debian Stretch [05:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:08] !log Drop mostly empty flagged* tables from metawiki (s7) - T193678 [05:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:13] T193678: Drop flaggedrevs tables at eswiki - https://phabricator.wikimedia.org/T193678 [05:59:32] !log Drop mostly empty flagged* tables from metawiki (s7) - T193390 [05:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:36] T193390: Drop flaggedrevs tables at metawiki - https://phabricator.wikimedia.org/T193390 [06:24:50] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 20 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [06:28:12] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt] [06:28:51] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_ssl_certfile] [06:29:52] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 6 probes of 300 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [06:44:57] (03CR) 10Urbanecm: [C: 031] "Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [06:51:26] (03PS2) 10Vgutierrez: lvs10[13-16] production DNS entries, all vlans [dns] - 10https://gerrit.wikimedia.org/r/430402 (https://phabricator.wikimedia.org/T184293) [06:51:58] (03CR) 10Vgutierrez: [C: 032] lvs10[13-16] production DNS entries, all vlans [dns] - 10https://gerrit.wikimedia.org/r/430402 (https://phabricator.wikimedia.org/T184293) (owner: 10Vgutierrez) [06:53:18] (03PS2) 10Muehlenhoff: Remove mwdebug1002 from debug proxies for stretch reimage [puppet] - 10https://gerrit.wikimedia.org/r/429409 [06:55:58] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4177572 (10Vgutierrez) [06:56:27] (03CR) 10Muehlenhoff: [C: 032] Remove mwdebug1002 from debug proxies for stretch reimage [puppet] - 10https://gerrit.wikimedia.org/r/429409 (owner: 10Muehlenhoff) [06:58:21] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:02] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:11:10] !log reimaging mwdebug1002 to stretch [07:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:47] (03CR) 10Vgutierrez: [C: 031] "Nice catch :D" [dns] - 10https://gerrit.wikimedia.org/r/429874 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [07:26:22] !log Drop table flaggedrevs from eswikibooks - T193676 [07:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:26] T193676: Drop flaggedrevs tables at eswikibooks - https://phabricator.wikimedia.org/T193676 [07:26:52] (03PS3) 10Gilles: Reafactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) [07:34:03] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4177621 (10jcrespo) The check detected some difference, but they could be false positives, checking again. [07:34:48] (03CR) 10Gilles: "https://puppet-compiler.wmflabs.org/compiler02/11106/cp1008.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [07:35:15] (03PS4) 10Gilles: Refactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) [07:35:25] (03PS5) 10Gilles: Refactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) [07:36:46] (03CR) 10Gilles: Refactor varnishlog consumers (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [07:39:03] !log reimaging mw1256, mw1257, mw1258 (app servers) to stretch [07:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:36] (03PS1) 10Jcrespo: mariadb: Add main_tables.txt, collection of main mediawiki tables [software] - 10https://gerrit.wikimedia.org/r/430557 (https://phabricator.wikimedia.org/T104459) [07:43:00] (03CR) 10Marostegui: [C: 031] "I will add some more later - specially the ones that historically had issues" [software] - 10https://gerrit.wikimedia.org/r/430557 (https://phabricator.wikimedia.org/T104459) (owner: 10Jcrespo) [07:43:13] (03CR) 10Jcrespo: [C: 032] mariadb: Add main_tables.txt, collection of main mediawiki tables [software] - 10https://gerrit.wikimedia.org/r/430557 (https://phabricator.wikimedia.org/T104459) (owner: 10Jcrespo) [07:54:54] !log reimaging mw1284, mw1289, mw1290 (API servers) to stretch [07:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:28] (03PS1) 10Marostegui: main_tables: Add 3 more tables [software] - 10https://gerrit.wikimedia.org/r/430559 (https://phabricator.wikimedia.org/T104459) [08:00:59] (03CR) 10Filippo Giunchedi: role::prometheus::analytics: rename cassandra metrics/labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [08:06:16] (03CR) 10Elukey: role::prometheus::analytics: rename cassandra metrics/labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [08:06:25] (03CR) 10Volans: "Thanks for the fixes! I've left a comments and a couple of nitpicks inline, the refactor looks much nicer now." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [08:08:02] !log eqiad-prod: more weight to ms-be104[0-3] - T190081 [08:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:06] T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081 [08:08:24] (03CR) 10Elukey: role::prometheus::analytics: rename cassandra metrics/labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [08:12:58] (03PS3) 10Elukey: role::prometheus::analytics: rename cassandra metrics/labels [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) [08:13:11] !log cp-misc: upgrade varnish to 5.1.3-1wm8 T192368 [08:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:15] T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 [08:13:32] (03CR) 10Filippo Giunchedi: role::prometheus::analytics: rename cassandra metrics/labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [08:17:47] !log mobrovac@tin Started deploy [cpjobqueue/deploy@5c1dcb9]: Bug fix: Resubscribe to the proper list of topics on metadata change [08:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:41] !log mobrovac@tin Finished deploy [cpjobqueue/deploy@5c1dcb9]: Bug fix: Resubscribe to the proper list of topics on metadata change (duration: 00m 54s) [08:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:46] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [08:20:42] * addshore goes to deploy 2 small things to WikimediaEvents [08:22:11] PROBLEM - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. [08:26:21] !log mobrovac@tin Started deploy [changeprop/deploy@7e86531]: Bug fix: Resubscribe to the proper list of topics on metadata change [08:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:55] (03CR) 10Jcrespo: "Ok with the addition, but maybe main_tables.txt should be renamed to something else, change_tag and tag_summary are not part of core, I th" [software] - 10https://gerrit.wikimedia.org/r/430559 (https://phabricator.wikimedia.org/T104459) (owner: 10Marostegui) [08:27:33] !log mobrovac@tin Finished deploy [changeprop/deploy@7e86531]: Bug fix: Resubscribe to the proper list of topics on metadata change (duration: 01m 12s) [08:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:41] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: rename cassandra metrics/labels [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [08:29:19] !log addshore@tin Synchronized php-1.32.0-wmf.1/extensions/WikimediaEvents/WikimediaEventsHooks.php: T191500 [[gerrit:430379|Update campaign prefix for onBeforeInitializeWMDECampaign hook]] (duration: 01m 17s) [08:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:24] T191500: deploy patch & logging for tracking user registrations - https://phabricator.wikimedia.org/T191500 [08:29:44] (03CR) 10Marostegui: "> Ok with the addition, but maybe main_tables.txt should be renamed" [software] - 10https://gerrit.wikimedia.org/r/430559 (https://phabricator.wikimedia.org/T104459) (owner: 10Marostegui) [08:30:45] (03CR) 10Filippo Giunchedi: "Puppet is failing on graphite machines trying to remove coal user which is in use by coal itself, I'm assuming it is safe to restart coal " [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [08:30:59] !log addshore@tin Synchronized php-1.32.0-wmf.2/extensions/WikimediaEvents/WikimediaEventsHooks.php: T191500 [[gerrit:430380|Update campaign prefix for onBeforeInitializeWMDECampaign hook]] (duration: 01m 16s) [08:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:11] PROBLEM - MD RAID on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:32:11] PROBLEM - Check size of conntrack table on mw1256 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:32:11] PROBLEM - Check size of conntrack table on mw1257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:32:11] PROBLEM - MD RAID on mw1257 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:32:12] PROBLEM - Check size of conntrack table on mw1258 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:32:12] PROBLEM - MD RAID on mw1258 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:32:54] ^ silencing [08:33:49] !log installing Java security updates on wdqs* [08:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:08] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for jenkins on release servers [puppet] - 10https://gerrit.wikimedia.org/r/430562 (https://phabricator.wikimedia.org/T135991) [08:43:46] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for jenkins on release servers [puppet] - 10https://gerrit.wikimedia.org/r/430562 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:45:03] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for jenkins on release servers [puppet] - 10https://gerrit.wikimedia.org/r/430562 (https://phabricator.wikimedia.org/T135991) [08:47:03] (03CR) 10Gilles: Refactor varnishlog consumers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [08:48:37] (03PS6) 10Gilles: Refactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) [08:48:43] (03CR) 10Gilles: Refactor varnishlog consumers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [08:54:07] gilles: sorry I'm not sure I get your answer in the above CR. If it's ok to use the class without overriding any method, then it's not a base/abstract one but a concrete one. What am I missing? [08:55:09] you will have to override at least one method for it to do something useful, but none of the methods on their own have mandatory override [08:55:41] so NotImplementedError doesn't make sense for any of them [08:56:17] if you run the class as-is (which still requires a different script to "drive" it) it's going to consume the log output but do nothing with it [08:56:43] ok, then maybe metaclass=ABCMeta is the way to go in this case [08:57:34] from abc import ABCMeta; class BaseVarnishLogConsumer(object, metaclass=ABCMeta):... [08:57:41] ok [08:57:48] but I need to check one thing [08:58:03] don't remember if one of the method musth be decorated [09:00:01] isn't the point of setting ABCMeta to leverage it on some methods? [09:00:14] just adding it for the sake of it doesn't seem like it would do anything functional [09:00:24] without any decorated method or property [09:01:11] that's what I was worried [09:01:17] give me a minute :) [09:03:08] (03PS1) 10Elukey: role::prometheus::analytics: fix cassandra relabel config [puppet] - 10https://gerrit.wikimedia.org/r/430563 (https://phabricator.wikimedia.org/T193017) [09:04:21] (03PS2) 10Elukey: role::prometheus::analytics: fix cassandra relabel config [puppet] - 10https://gerrit.wikimedia.org/r/430563 (https://phabricator.wikimedia.org/T193017) [09:05:06] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus::analytics: fix cassandra relabel config [puppet] - 10https://gerrit.wikimedia.org/r/430563 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [09:05:47] gilles: in short, it's ok as it its, we'll rely on the name BaseFoo to indicate that it's 'abstract' and that doesn't do anything if you don't override something [09:06:03] cool, that's what I assumed [09:06:04] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: fix cassandra relabel config [puppet] - 10https://gerrit.wikimedia.org/r/430563 (https://phabricator.wikimedia.org/T193017) (owner: 10Elukey) [09:06:10] as the only thing that make sense to force overriding is the 'description' class property, but there is no easy way to do it [09:06:23] without additional boilerplate that looks an overkill to me [09:06:53] !log reimaging mw1300 (job runner) to stretch [09:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:04] gilles: so ok for me :) [09:08:31] (03CR) 10Jcrespo: "> > Ok with the addition, but maybe main_tables.txt should be renamed" [software] - 10https://gerrit.wikimedia.org/r/430559 (https://phabricator.wikimedia.org/T104459) (owner: 10Marostegui) [09:09:42] !log rolling restart of elasticsearch completed - T191543 / T191236 [09:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:47] T191543: Deploy updated search/extra plugin and search/extra-analysis-slovak plugin with Slovak Stemmer - https://phabricator.wikimedia.org/T191543 [09:09:47] T191236: Resolve elasticsearch latency alerts - https://phabricator.wikimedia.org/T191236 [09:11:07] (03CR) 10Amire80: [C: 031] Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [09:13:12] (03PS2) 10Marostegui: tables_to_check: Add more tables,rename the file [software] - 10https://gerrit.wikimedia.org/r/430559 (https://phabricator.wikimedia.org/T104459) [09:18:42] (03CR) 10Jcrespo: [C: 031] tables_to_check: Add more tables,rename the file [software] - 10https://gerrit.wikimedia.org/r/430559 (https://phabricator.wikimedia.org/T104459) (owner: 10Marostegui) [09:19:15] (03CR) 10Marostegui: [C: 032] tables_to_check: Add more tables,rename the file [software] - 10https://gerrit.wikimedia.org/r/430559 (https://phabricator.wikimedia.org/T104459) (owner: 10Marostegui) [09:20:04] (03Merged) 10jenkins-bot: tables_to_check: Add more tables,rename the file [software] - 10https://gerrit.wikimedia.org/r/430559 (https://phabricator.wikimedia.org/T104459) (owner: 10Marostegui) [09:29:56] !log reimaging mw1227, mw1231, mw1232 (API servers) to stretch [09:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:08] (03PS1) 10Volans: wmf-auto-reimage: log unit masking [puppet] - 10https://gerrit.wikimedia.org/r/430567 [09:36:26] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/430567 (owner: 10Volans) [09:39:18] (03PS2) 10Volans: wmf-auto-reimage: log unit masking [puppet] - 10https://gerrit.wikimedia.org/r/430567 [09:41:41] (03CR) 10Volans: [C: 032] wmf-auto-reimage: log unit masking [puppet] - 10https://gerrit.wikimedia.org/r/430567 (owner: 10Volans) [09:42:00] (03PS1) 10Muehlenhoff: Revert "Remove mwdebug1002 from debug proxies for stretch reimage" [puppet] - 10https://gerrit.wikimedia.org/r/430568 [09:43:09] (03PS2) 10Muehlenhoff: Revert "Remove mwdebug1002 from debug proxies for stretch reimage" [puppet] - 10https://gerrit.wikimedia.org/r/430568 [09:43:48] (03CR) 10Muehlenhoff: [C: 032] Revert "Remove mwdebug1002 from debug proxies for stretch reimage" [puppet] - 10https://gerrit.wikimedia.org/r/430568 (owner: 10Muehlenhoff) [10:13:40] marostegui: jynus: i see quite a number of "Wikimedia\Rdbms\DBReplicationWaitError: Could not wait for replica DBs to catch up to db1052" in the last 30 minutes [10:13:43] id db1052 ok? [10:13:45] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:13:45] s/id/is/ [10:13:57] mobrovac: checking [10:15:17] mobrovac: that is a software bug [10:15:26] PROBLEM - Nginx local proxy to apache on mw1231 is CRITICAL: connect to address 10.64.48.66 and port 443: Connection refused [10:15:26] PROBLEM - MD RAID on mw1227 is CRITICAL: Return code of 255 is out of bounds [10:15:26] PROBLEM - Check systemd state on mw1227 is CRITICAL: Return code of 255 is out of bounds [10:15:26] PROBLEM - Check whether ferm is active by checking the default input chain on mw1231 is CRITICAL: Return code of 255 is out of bounds [10:15:26] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1232 is CRITICAL: Return code of 255 is out of bounds [10:15:28] there is no lag [10:15:42] Wikimedia\Rdbms\LoadBalancer::doWait: Timed out waiting on db1067 pos 0-171970637-5478094441,180359172-180359172-49702203,171970637-171970637-1557309208 [10:15:53] yep, I checked the graphs and there is no lag [10:15:54] that is someting we reported [10:16:10] and suggested a fix, apprently didn't get fixed [10:17:05] PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: connect to address 10.64.48.67 and port 443: Connection refused [10:17:05] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1227 is CRITICAL: Return code of 255 is out of bounds [10:17:06] PROBLEM - DPKG on mw1231 is CRITICAL: Return code of 255 is out of bounds [10:17:06] PROBLEM - Check whether ferm is active by checking the default input chain on mw1232 is CRITICAL: Return code of 255 is out of bounds [10:17:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:18:41] oh right, right jynus, there are some patches up for that, iirc they haven't been merged yet [10:18:45] PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: connect to address 10.64.48.62 and port 443: Connection refused [10:18:45] PROBLEM - Check whether ferm is active by checking the default input chain on mw1227 is CRITICAL: Return code of 255 is out of bounds [10:18:45] PROBLEM - DPKG on mw1232 is CRITICAL: Return code of 255 is out of bounds [10:18:45] PROBLEM - configured eth on mw1231 is CRITICAL: Return code of 255 is out of bounds [10:33:06] RECOVERY - Check size of conntrack table on mw1257 is OK: OK: nf_conntrack is 0 % full [10:33:06] RECOVERY - MD RAID on mw1257 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:33:06] RECOVERY - Check size of conntrack table on mw1256 is OK: OK: nf_conntrack is 0 % full [10:33:06] RECOVERY - MD RAID on mw1256 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:34:15] RECOVERY - Check size of conntrack table on mw1258 is OK: OK: nf_conntrack is 0 % full [10:34:15] RECOVERY - MD RAID on mw1258 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [10:38:49] https://meta.wikimedia.org/wiki/Special:CentralAuth/Pixelight hmmmmmmmm weird, was this in the last stuck list? [10:41:13] 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#4178042 (10fgiunchedi) Thanks for the feedback! >>! In T178690#4171433, @akosiaris wrote: >>>! In T178690#4168673, @Volans wrote: >> As discussed in the monitoring meetin... [10:51:24] !log reimaging mw1319, mw1325, mw1326 (app servers) to stretch [10:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:41] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4178052 (10MoritzMuehlenhoff) [10:58:04] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#3561778 (10MoritzMuehlenhoff) All mwdebug servers are now running stretch. [11:05:58] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#4178062 (10Mvolz) [11:57:44] !log reimaging mw1301, mw1302 (job runners) to stretch [11:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:48] (03PS1) 10Arturo Borrero Gonzalez: openstack: nova.conf: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430581 (https://phabricator.wikimedia.org/T193657) [12:18:57] it took `git review` 3m to run :-/ [12:19:06] for a simple patch upload [12:19:59] PROBLEM - Apache HTTP on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:09] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:15] jouncebot: next [12:20:15] In 0 hour(s) and 39 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180503T1300) [12:20:19] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:20] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:49] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:49] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:22:19] PROBLEM - mediawiki-installation DSH group on mw1231 is CRITICAL: Host mw1231 is not in mediawiki-installation dsh group [12:23:39] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 7.081 second response time [12:23:39] PROBLEM - HHVM processes on mw1232 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:23:39] PROBLEM - MD RAID on mw1232 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:23:39] PROBLEM - Check systemd state on mw1232 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:23:40] RECOVERY - DPKG on mw1232 is OK: All packages OK [12:23:49] PROBLEM - nutcracker process on mw1232 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [12:23:49] PROBLEM - configured eth on mw1231 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:23:49] PROBLEM - DPKG on mw1231 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:23:59] PROBLEM - Check systemd state on mw1231 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:23:59] PROBLEM - mediawiki-installation DSH group on mw1232 is CRITICAL: Host mw1232 is not in mediawiki-installation dsh group [12:24:09] PROBLEM - Check size of conntrack table on mw1227 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:24:10] PROBLEM - HHVM processes on mw1227 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:24:10] PROBLEM - MD RAID on mw1227 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:24:10] PROBLEM - nutcracker port on mw1232 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [12:24:10] PROBLEM - nutcracker process on mw1231 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [12:24:17] ^ silenced [12:24:29] RECOVERY - Check whether ferm is active by checking the default input chain on mw1232 is OK: OK ferm input default policy is set [12:24:30] RECOVERY - HHVM processes on mw1232 is OK: PROCS OK: 6 processes with command name hhvm [12:24:30] RECOVERY - MD RAID on mw1232 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:24:40] RECOVERY - Check whether ferm is active by checking the default input chain on mw1231 is OK: OK ferm input default policy is set [12:24:49] RECOVERY - configured eth on mw1231 is OK: OK - interfaces up [12:24:49] RECOVERY - Check whether ferm is active by checking the default input chain on mw1227 is OK: OK ferm input default policy is set [12:24:50] RECOVERY - DPKG on mw1231 is OK: All packages OK [12:25:10] RECOVERY - Check size of conntrack table on mw1227 is OK: OK: nf_conntrack is 0 % full [12:25:19] RECOVERY - HHVM processes on mw1227 is OK: PROCS OK: 6 processes with command name hhvm [12:25:20] RECOVERY - MD RAID on mw1227 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:26:46] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1962 bytes in 0.109 second response time [12:26:52] !log installing openjdk-8 security updates on stretch-based Hadoop workers [12:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:16] RECOVERY - nutcracker process on mw1232 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:29:27] RECOVERY - Nginx local proxy to apache on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 5.410 second response time [12:29:36] RECOVERY - nutcracker port on mw1232 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:29:57] RECOVERY - Check systemd state on mw1232 is OK: OK - running: The system is fully operational [12:30:17] RECOVERY - Check systemd state on mw1231 is OK: OK - running: The system is fully operational [12:30:17] RECOVERY - Nginx local proxy to apache on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.498 second response time [12:30:36] RECOVERY - nutcracker process on mw1231 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:30:37] RECOVERY - Apache HTTP on mw1231 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.076 second response time [12:30:47] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 75393 bytes in 8.255 second response time [12:31:06] RECOVERY - Check systemd state on mw1227 is OK: OK - running: The system is fully operational [12:31:06] RECOVERY - Nginx local proxy to apache on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.413 second response time [12:31:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Clarify that db1060 is running an later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430584 [12:31:26] RECOVERY - HHVM rendering on mw1227 is OK: HTTP OK: HTTP/1.1 200 OK - 75393 bytes in 5.595 second response time [12:31:36] RECOVERY - Apache HTTP on mw1227 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.057 second response time [12:32:16] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 75391 bytes in 0.146 second response time [12:32:17] PROBLEM - Nginx local proxy to apache on mw1301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:53] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Clarify that db1060 is running an later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430584 (owner: 10Marostegui) [12:33:47] PROBLEM - mediawiki-installation DSH group on mw1302 is CRITICAL: Host mw1302 is not in mediawiki-installation dsh group [12:34:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Clarify that db1060 is running an later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430584 (owner: 10Marostegui) [12:35:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Revert: Clarify that db1060 is running an alter table (duration: 01m 17s) [12:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:56] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1951 bytes in 0.104 second response time [12:37:57] PROBLEM - HHVM jobrunner on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:36] (03PS2) 10Arturo Borrero Gonzalez: openstack: nova.conf: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430581 (https://phabricator.wikimedia.org/T193657) [12:38:47] PROBLEM - mediawiki-installation DSH group on mw1301 is CRITICAL: Host mw1301 is not in mediawiki-installation dsh group [12:39:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Clarify that db1060 is running an later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430584 (owner: 10Marostegui) [12:42:57] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:07] RECOVERY - Ubuntu mirror in sync with upstream on sodium is OK: /srv/mirrors/ubuntu is over 0 hours old. [12:45:36] PROBLEM - Nginx local proxy to apache on mw1302 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:45:46] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1232 is OK: OK: synced at Thu 2018-05-03 12:45:38 UTC. [12:47:26] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1227 is OK: OK: synced at Thu 2018-05-03 12:47:17 UTC. [12:47:50] (03PS2) 10Hoo man: Wikidata entity dumps: Move generic parts into functions [puppet] - 10https://gerrit.wikimedia.org/r/430395 (https://phabricator.wikimedia.org/T190513) [12:47:52] (03PS1) 10Hoo man: Create RDF dumps in batches, not all at once [puppet] - 10https://gerrit.wikimedia.org/r/430585 (https://phabricator.wikimedia.org/T190513) [12:49:16] (03PS2) 10Hoo man: Create RDF dumps in batches, not all at once [puppet] - 10https://gerrit.wikimedia.org/r/430585 (https://phabricator.wikimedia.org/T190513) [12:51:39] (03PS1) 10Muehlenhoff: Update Cumin aliases for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/430586 [12:53:46] (03CR) 10Muehlenhoff: [C: 032] Update Cumin aliases for Kafka [puppet] - 10https://gerrit.wikimedia.org/r/430586 (owner: 10Muehlenhoff) [12:55:04] (03PS3) 10ArielGlenn: Wikidata entity dumps: Move generic parts into functions [puppet] - 10https://gerrit.wikimedia.org/r/430395 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [12:55:42] (03CR) 10Hoo man: "Tested with testwikidata" [puppet] - 10https://gerrit.wikimedia.org/r/430585 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [12:55:52] (03CR) 10ArielGlenn: [C: 032] Wikidata entity dumps: Move generic parts into functions [puppet] - 10https://gerrit.wikimedia.org/r/430395 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [12:59:09] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.006 second response time [12:59:19] RECOVERY - HHVM jobrunner on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180503T1300). [13:00:04] No GERRIT patches in the queue for this window AFAICS. [13:00:12] =o [13:00:31] If there are no changes scheduled, maybe I'll schedule some :D [13:01:30] RECOVERY - Nginx local proxy to apache on mw1302 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.023 second response time [13:01:50] RECOVERY - Nginx local proxy to apache on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time [13:02:00] (03PS4) 10Addshore: Switch to extension.json for PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395486 [13:02:58] !log Deploy schema change on s4 codfw master db2051 with replication (this will generate lag on codfw) - T191519 T188299 T190148 [13:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:06] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [13:03:06] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [13:03:06] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [13:03:30] (03PS4) 10Addshore: Switch to extension.json for WikibaseQuality extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395487 [13:03:34] (03CR) 10jerkins-bot: [V: 04-1] Switch to extension.json for WikibaseQuality extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395487 (owner: 10Addshore) [13:03:42] (03PS4) 10Addshore: Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 [13:03:55] (03CR) 10jerkins-bot: [V: 04-1] Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 (owner: 10Addshore) [13:04:01] (03PS5) 10Addshore: Switch to extension.json for WikibaseQuality extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395487 [13:04:06] (03PS5) 10Addshore: Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 [13:04:40] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:04:48] !log rolling restart of wdqs for jvm upgrade [13:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:03] ^ that's harmless and fallout of the job runner reimage, will recover in a bit [13:05:07] (03Abandoned) 10Addshore: Switch to extension.json for WikibaseQuality extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395487 (owner: 10Addshore) [13:05:17] (03PS6) 10Addshore: Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 [13:05:23] !log stop and mask coal service on graphite hosts - T186774 [13:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:27] T186774: Migrate webperf from hafnium to webperf1001 - https://phabricator.wikimedia.org/T186774 [13:05:43] marlier: ^ [13:06:06] moritzm: ^^^... wdqs restart in progress [13:06:40] gehel: ack, thanks [13:06:55] (03CR) 10Addshore: [C: 032] Switch to extension.json for PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395486 (owner: 10Addshore) [13:08:19] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:08:22] (03Merged) 10jenkins-bot: Switch to extension.json for PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395486 (owner: 10Addshore) [13:08:49] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:09:47] (03CR) 10jenkins-bot: Switch to extension.json for PropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395486 (owner: 10Addshore) [13:09:50] godog: thank you! [13:10:27] marlier: you are welcome, thanks for taking care of it! [13:10:50] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:11:35] (03PS3) 10ArielGlenn: Create RDF dumps in batches, not all at once [puppet] - 10https://gerrit.wikimedia.org/r/430585 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [13:12:20] (03CR) 10ArielGlenn: [C: 032] Create RDF dumps in batches, not all at once [puppet] - 10https://gerrit.wikimedia.org/r/430585 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [13:13:58] !log addshore@tin Synchronized wmf-config/: [[gerrit:395486|Switch to extension.json for PropertySuggester]] (duration: 01m 35s) [13:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:54] (03CR) 10Addshore: [C: 032] Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 (owner: 10Addshore) [13:16:53] (03Merged) 10jenkins-bot: Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 (owner: 10Addshore) [13:19:59] !log addshore@tin Synchronized wmf-config/: [[gerrit:395488|Switch to extension.json for Wikidata.org]] (duration: 01m 19s) [13:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:13] Amir1: woo, I finally got those 2 patches in :P [13:20:31] \o/ [13:20:36] That's fantastic! [13:20:52] * addshore is going to do the lock manager stuff in the run up to the lexeme release [13:21:05] (03PS1) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) [13:21:14] !log swat done [13:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:47] (03Abandoned) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429810 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [13:21:50] (03CR) 10jerkins-bot: [V: 04-1] varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [13:22:50] 10Operations, 10Wikidata: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733#4178226 (10Ladsgroup) [13:23:16] (03PS2) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) [13:27:40] 10Operations, 10Wikidata: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733#4178239 (10hoo) What makes you think that Terbium is to unstable for this? Terbium seems to always have more than enough spare resources, and the dispatch problems we saw recently seem to corr... [13:28:37] (03CR) 10Volans: "Missed comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [13:29:19] (03CR) 10Filippo Giunchedi: [C: 031] Add .gitreview file [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/430306 (owner: 10Gilles) [13:33:41] (03CR) 10Volans: "A couple of minor comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [13:35:54] (03PS3) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) [13:36:45] (03CR) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [13:37:24] volans: thanks! [13:38:21] vgutierrez: don't pick anything I write as it is... I was too quick to write, it was an if ...: return ofc, not pass :D [13:38:25] sorry about that [13:38:31] volans: hahaha right [13:39:09] (03PS4) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) [13:39:19] fixed :) [13:39:28] thanks, and sorry :) [13:39:40] it's my fault as well [13:50:35] (03CR) 10jenkins-bot: Switch to extension.json for Wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395488 (owner: 10Addshore) [13:59:33] (03CR) 10Volans: [C: 031] "LGTM for the python/puppet part. I have no context to judge the varnish log part." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:00:32] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1064 is CRITICAL: cluster=mysql device={megaraid,2,megaraid,6} instance=db1064:9100 job=node site=eqiad Marostegui we will let them fail and once failed, replace them. These hosts are slaves https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops [14:00:32] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1066 is CRITICAL: cluster=mysql device=megaraid,6 instance=db1066:9100 job=node site=eqiad Marostegui we will let them fail and once failed, replace them. These hosts are slaves https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1066&var-datasource=eqiad%2520prometheus%252Fops [14:00:47] !log cp3030 (text): upgrade varnish to 5.1.3-1wm8 T192368 [14:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:52] T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 [14:01:16] (03PS7) 10Gilles: Refactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) [14:01:32] (03CR) 10Gilles: Refactor varnishlog consumers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [14:01:48] (03CR) 10jerkins-bot: [V: 04-1] Refactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [14:02:11] gilles: trailing spaces on line 37 [14:02:38] I need to reinstall a plugin that takes care of that automatically [14:02:44] eheheh :) [14:03:49] (03PS8) 10Gilles: Refactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) [14:04:08] 10Operations, 10Analytics: dbstore1002 disk 5 not healthy - https://phabricator.wikimedia.org/T193738#4178362 (10Marostegui) [14:05:01] ACKNOWLEDGEMENT - Device not healthy -SMART- on dbstore1002 is CRITICAL: cluster=mysql device=megaraid,5 instance=dbstore1002:9100 job=node site=eqiad Marostegui https://phabricator.wikimedia.org/T193738 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbstore1002&var-datasource=eqiad%2520prometheus%252Fops [14:05:21] (03CR) 10Volans: [C: 031] "LGTM for the generic python part and puppet. I'll leave it to the other reviewers for the varnish logs / logstash details." [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [14:05:46] (03PS1) 10Jcrespo: maridb: Depool db1056 for decommissioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430596 (https://phabricator.wikimedia.org/T193736) [14:08:57] (03CR) 10Jcrespo: [C: 032] maridb: Depool db1056 for decommissioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430596 (https://phabricator.wikimedia.org/T193736) (owner: 10Jcrespo) [14:10:24] (03Merged) 10jenkins-bot: maridb: Depool db1056 for decommissioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430596 (https://phabricator.wikimedia.org/T193736) (owner: 10Jcrespo) [14:13:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1056 (duration: 01m 17s) [14:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:51] (03PS1) 10Marostegui: db1060: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/430598 (https://phabricator.wikimedia.org/T193732) [14:27:30] ping marlier: https://gerrit.wikimedia.org/r/#/c/429252/ see godog's comment [14:28:27] (03CR) 10Marostegui: [C: 032] db1060: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/430598 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [14:29:41] RECOVERY - mediawiki-installation DSH group on mw1302 is OK: OK [14:29:48] ottomata: yup we've worked on it earlier, all good on graphite hosts now, thanks! [14:31:14] (03CR) 10Ema: [C: 031] "Looks great!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:33:04] oh ok great [14:33:08] (03PS5) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) [14:33:23] (03PS9) 10Ema: Refactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [14:34:04] (03CR) 10Ema: [C: 032] Refactor varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/429843 (https://phabricator.wikimedia.org/T193489) (owner: 10Gilles) [14:34:55] (03CR) 10Arturo Borrero Gonzalez: [C: 032] ruby: install libmysqlclient-dev package in the base image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/429758 (https://phabricator.wikimedia.org/T192566) (owner: 10Arturo Borrero Gonzalez) [14:35:38] (03PS6) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) [14:36:16] (03CR) 10jenkins-bot: maridb: Depool db1056 for decommissioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430596 (https://phabricator.wikimedia.org/T193736) (owner: 10Jcrespo) [14:37:01] PROBLEM - Check systemd state on mw1275 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:37:12] (03CR) 10Muehlenhoff: [C: 04-1] "There's no partman recipe configured for mwmaint1001, I think it should simply use the mw-raid1-lvm.cfg recipe as the actual application s" [dns] - 10https://gerrit.wikimedia.org/r/430518 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [14:37:21] PROBLEM - nutcracker port on mw1275 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [14:37:51] PROBLEM - nutcracker process on mw1275 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [14:38:31] (03PS7) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) [14:38:52] RECOVERY - mediawiki-installation DSH group on mw1301 is OK: OK [14:40:35] ottomata: we took care of it a few hours ago. [14:40:38] Thanks, though :-) [14:40:55] (03CR) 10Vgutierrez: varnishtlsinspector: send TLS connection details to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:41:07] (03CR) 10Elukey: [C: 031] memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [14:41:25] (03CR) 10Muehlenhoff: [C: 031] admin: update comments about terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430523 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [14:41:48] (03CR) 10Andrew Bogott: "Do we have prometheus monitoring to replace these?" [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [14:42:35] (03CR) 10Vgutierrez: [C: 032] varnishtlsinspector: send TLS connection details to logstash [puppet] - 10https://gerrit.wikimedia.org/r/430593 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:43:06] (03PS1) 10Jcrespo: mariadb: Remove mediawiki references to db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430599 (https://phabricator.wikimedia.org/T193736) [14:43:26] (03CR) 10Muehlenhoff: [C: 031] "Ack, looks good. I couldn't test this class in PCC, so didn't spot this. Sorry." [puppet] - 10https://gerrit.wikimedia.org/r/430425 (owner: 10Thcipriani) [14:43:32] (03PS2) 10Muehlenhoff: Remove duplication ghostscript package declaration [puppet] - 10https://gerrit.wikimedia.org/r/430425 (owner: 10Thcipriani) [14:44:21] !log imarlier@tin Started deploy [performance/coal@bd7568a]: verify coal is deploying properly after shutdown on graphite hosts [14:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:40] (03CR) 10Filippo Giunchedi: [C: 04-1] "Actually realized memcached collector is more widely used than I thought, this will have to wait" [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [14:46:38] (03CR) 10Muehlenhoff: [C: 032] Remove duplication ghostscript package declaration [puppet] - 10https://gerrit.wikimedia.org/r/430425 (owner: 10Thcipriani) [14:47:02] !log Manually set offline disk #1 on db1063 so it can be replaced [14:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:08] (03PS1) 10Vgutierrez: varnishtlsinspector: fix class name (typo) [puppet] - 10https://gerrit.wikimedia.org/r/430600 [14:51:37] ouch... missed that, bad eye day [14:51:47] (03PS2) 10Vgutierrez: varnishtlsinspector: fix class name (typo) [puppet] - 10https://gerrit.wikimedia.org/r/430600 (https://phabricator.wikimedia.org/T193376) [14:52:04] volans: 3 of us missed that actually [14:52:25] (03CR) 10Anomie: [C: 031] "Yaml looks good. So does the same output query." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [14:52:26] but hey, I was consistent with the typo, so the code worked [14:53:04] (03CR) 10Vgutierrez: [C: 032] varnishtlsinspector: fix class name (typo) [puppet] - 10https://gerrit.wikimedia.org/r/430600 (https://phabricator.wikimedia.org/T193376) (owner: 10Vgutierrez) [14:54:21] PROBLEM - MegaRAID on db1063 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [14:54:22] ACKNOWLEDGEMENT - MegaRAID on db1063 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T193747 [14:54:22] (03CR) 10Dzahn: "did that here https://gerrit.wikimedia.org/r/#/c/430519/" [dns] - 10https://gerrit.wikimedia.org/r/430518 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [14:54:27] 10Operations, 10ops-eqiad: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T193747#4178568 (10ops-monitoring-bot) [14:54:30] volans: confirmed, automation keeps working ^ [14:54:43] marostegui: lol [14:54:43] (03PS1) 10Hoo man: Wikidata entity dumps: Allow continuing [puppet] - 10https://gerrit.wikimedia.org/r/430604 (https://phabricator.wikimedia.org/T193688) [14:55:45] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T193747#4178575 (10Marostegui) p:05Triage>03Normal a:03Cmjohnson This is m1 master - we failed the disk manually as it has errors [14:56:18] volans: you are welcome [14:56:19] XDD [14:56:52] you now need to perform a weekly test of it [14:57:00] from now on... :D [14:57:16] hahah [14:58:41] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1063 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1063:9100 job=node site=eqiad Marostegui https://phabricator.wikimedia.org/T193747 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1063&var-datasource=eqiad%2520prometheus%252Fops [15:01:10] PROBLEM - mediawiki-installation DSH group on mw1275 is CRITICAL: Host mw1275 is not in mediawiki-installation dsh group [15:02:13] I imagine y'all are in the middle of something right now, but if someone is able to look into the flood of attempted logins enwiki has gotten today that would be appreciated. [15:04:35] !log imarlier@tin Started deploy [performance/coal@762d160]: verify coal is deploying properly after shutdown on graphite hosts [15:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:45] heya i don't have any scrollback but i feel like someone in here should be aware that there have been more than 20 reports of several failed login attempts across en-wp accounts since 7 am EST ranging from established users to new. Is this a bug (would yall even know this?) or some brute force attempt at getting into accounts? [15:04:49] !log imarlier@tin Finished deploy [performance/coal@762d160]: verify coal is deploying properly after shutdown on graphite hosts (duration: 00m 14s) [15:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:13] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T193747#4178568 (10RobH) The warranty on db1063 has expired, and is no longer under warranty support. Any failed disks will need to be replaced from shelf spares. [15:06:05] TheDragonFire: Chrissymad we have been looking into it, thanks for the reports [15:06:12] cool, thanks jynus [15:07:54] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign spare systems as terbium equivalent - https://phabricator.wikimedia.org/T192185#4178627 (10RobH) [15:08:44] (03CR) 10Hoo man: "Fully tested for both manual abort (Ctrl+c) and 5 script failures in a row. Continued dump runs produce equal results to non-interrupted d" [puppet] - 10https://gerrit.wikimedia.org/r/430604 (https://phabricator.wikimedia.org/T193688) (owner: 10Hoo man) [15:11:52] I have also received an email telling me someone has tried to log in with my account [15:13:21] ACKNOWLEDGEMENT - Check systemd state on mw1275 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Muehlenhoff hardware issue under investigation, see T192902 [15:13:21] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1275 is CRITICAL: Host mw1275 is not in mediawiki-installation dsh group Muehlenhoff hardware issue under investigation, see T192902 [15:13:21] ACKNOWLEDGEMENT - nutcracker port on mw1275 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Muehlenhoff hardware issue under investigation, see T192902 [15:13:21] ACKNOWLEDGEMENT - nutcracker process on mw1275 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker Muehlenhoff hardware issue under investigation, see T192902 [15:22:30] RECOVERY - mediawiki-installation DSH group on mw1231 is OK: OK [15:24:10] RECOVERY - mediawiki-installation DSH group on mw1232 is OK: OK [15:28:33] (03CR) 10ArielGlenn: [C: 032] Wikidata entity dumps: Allow continuing [puppet] - 10https://gerrit.wikimedia.org/r/430604 (https://phabricator.wikimedia.org/T193688) (owner: 10Hoo man) [15:28:42] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T193747#4178684 (10Marostegui) Disk replaced by @Cmjohnson ``` root@db1063:~# megacli -PDRbld -ShowProg -PhysDrv [32:1] -aALL Rebuild Progress on Device at Enclosure 32, Slot 1 Completed 2% in 1 Minutes. ``` [15:31:56] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labstore1003 SMART failure - https://phabricator.wikimedia.org/T193651#4178701 (10Cmjohnson) There were 2 disks in labstore1003 that were bad or going bad. One on labstore1003 and one on labstore1003 array 2. Replaced them both. [15:37:24] (03CR) 10Chad: [C: 031] Enable base::service_auto_restart for jenkins on release servers [puppet] - 10https://gerrit.wikimedia.org/r/430562 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:38:37] Hi guys, I just got an e-mail someone attempted to log-in into my account [15:38:46] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4178726 (10Papaul) [15:39:11] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2011 - https://phabricator.wikimedia.org/T187886#3989185 (10Papaul) a:05Papaul>03RobH @RobH done on my end [15:39:26] odder: i think most everyone did at this point :( [15:41:03] Oh, something is going on? [15:41:46] Yeah, something [15:42:29] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4178762 (10RobH) [15:42:34] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2011 - https://phabricator.wikimedia.org/T187886#3989185 (10RobH) 05Open>03Resolved [15:42:37] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#4178768 (10RobH) [15:42:56] I have 2FA as per standard procedure so happy the account is secure, was just wondering if there might a wider attempt to breach functionaries' accounts [15:43:23] odder: I think in this case you're not special [15:43:25] Sorry to say :P [15:43:47] (03PS1) 10Vgutierrez: varnishlog: Fix encoding issues on Popen [puppet] - 10https://gerrit.wikimedia.org/r/430610 [15:44:29] 10Operations, 10Analytics: dbstore1002 disk 5 not healthy - https://phabricator.wikimedia.org/T193738#4178362 (10RobH) This has fallen out of warranty as of 2017-02-25, any failed disks (like the one for this task) will need to be replaced from shelf spares. [15:45:50] (03CR) 10Ema: [C: 031] varnishlog: Fix encoding issues on Popen [puppet] - 10https://gerrit.wikimedia.org/r/430610 (owner: 10Vgutierrez) [15:45:55] (03CR) 10Vgutierrez: [C: 032] varnishlog: Fix encoding issues on Popen [puppet] - 10https://gerrit.wikimedia.org/r/430610 (owner: 10Vgutierrez) [15:46:46] Reedy: Well, glad to know you guys are aware of this [15:50:23] (03PS1) 10AndyRussG: CentralNotice EventLogging banner impression data test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430613 (https://phabricator.wikimedia.org/T183978) [15:53:42] (03PS1) 10Arturo Borrero Gonzalez: openstack: api-paste.ini: rearange config file [puppet] - 10https://gerrit.wikimedia.org/r/430614 (https://phabricator.wikimedia.org/T193657) [15:57:42] PROBLEM - MegaRAID on labstore1003 is CRITICAL: CRITICAL: 2 failed LD(s) (Degraded, Partially Degraded) [15:57:46] ACKNOWLEDGEMENT - MegaRAID on labstore1003 is CRITICAL: CRITICAL: 2 failed LD(s) (Degraded, Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T193757 [15:57:52] 10Operations, 10ops-eqiad: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T193757#4178821 (10ops-monitoring-bot) [15:59:07] 10Operations, 10Graphite, 10Services (watching), 10User-fgiunchedi: Cassandra Graphite metrics space usage audit and cleanup - https://phabricator.wikimedia.org/T191315#4178829 (10RobH) p:05Triage>03Normal [16:00:04] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180503T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:31] 10Operations, 10ops-eqiad: Degraded RAID on labstore1003 - https://phabricator.wikimedia.org/T193757#4178834 (10RobH) p:05Triage>03High [16:01:34] 10Operations, 10Analytics: dbstore1002 disk 5 not healthy - https://phabricator.wikimedia.org/T193738#4178835 (10RobH) p:05Triage>03High [16:02:54] (03PS3) 10RobH: coal: require python-tz [puppet] - 10https://gerrit.wikimedia.org/r/430421 (https://phabricator.wikimedia.org/T193660) (owner: 10Imarlier) [16:03:37] (03CR) 10RobH: [C: 032] coal: require python-tz [puppet] - 10https://gerrit.wikimedia.org/r/430421 (https://phabricator.wikimedia.org/T193660) (owner: 10Imarlier) [16:05:08] 10Operations, 10Patch-For-Review: Merge one-line puppet fix - https://phabricator.wikimedia.org/T193660#4178841 (10RobH) 05Open>03Resolved a:03RobH discussed with @imarlier via irc and merged live. [16:05:43] 10Operations, 10Analytics: dbstore1002 disk 5 not healthy - https://phabricator.wikimedia.org/T193738#4178844 (10Marostegui) Not really sure if this is high priority. The disk hasn't failed yet even [16:05:45] 10Operations, 10ops-eqiad: tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628#4178845 (10RobH) p:05Triage>03Low [16:06:51] 10Operations, 10Puppet, 10Patch-For-Review: deprecate and remove --autoload in uwsgi puppet class - https://phabricator.wikimedia.org/T192102#4178858 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in #operations and attempting to review if any ar... [16:07:09] 10Operations, 10Discovery-Search: migrate elasticsearch to stretch - https://phabricator.wikimedia.org/T193649#4178863 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in #operations and attempting to review if any are critical, or if they are normal... [16:07:22] RECOVERY - Device not healthy -SMART- on db1063 is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1063&var-datasource=eqiad%2520prometheus%252Fops [16:07:57] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Elasticsearch: Alert when elasticsearch writes are frozen for too long - https://phabricator.wikimedia.org/T193605#4178866 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in #ope... [16:08:10] weee clinic duty. [16:08:15] setting priorities for all the things [16:08:34] so the strange oens that i have no idea how to prioritize stand out. [16:17:54] hey reddy will they ever explain what the cause of this was? [16:28:09] (03PS1) 10Jgreen: add frbast.wm.o cross-dc service alias [dns] - 10https://gerrit.wikimedia.org/r/430622 [16:29:55] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4178915 (10Vgutierrez) @Cmjohnson, I've been trying to boot lvs1016 with PXE with no luck, after some debugging with @ayounsi we've seen traffic incoming traffic on eth2 (asw... [16:31:22] RECOVERY - Device not healthy -SMART- on labstore1003 is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1003&var-datasource=eqiad%2520prometheus%252Fops [16:31:27] (03CR) 10Jgreen: [C: 032] add frbast.wm.o cross-dc service alias [dns] - 10https://gerrit.wikimedia.org/r/430622 (owner: 10Jgreen) [16:34:04] !log authdns-update to add frbast.wikimedia.org service alias [16:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:52] RECOVERY - MegaRAID on db1063 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [16:38:14] 10Operations, 10Wikidata: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733#4178954 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in #operations and attempting to review if any are critical, or if they ar... [16:38:35] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4178957 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in #operations and a... [16:38:57] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4178959 (10RobH) p:05Triage>03Normal [16:40:39] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10Release-Engineering-Team (Watching / External): Logstash no longer captures DB queries in debug mode - https://phabricator.wikimedia.org/T190455#4178963 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned,... [16:40:54] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4178965 (10Cmjohnson) @Vgutierrez I flipped the cables. I did put the cables into what is on the card labeled port 1 and port 2 but I think the card is inserted upside down o... [16:42:51] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473#4178977 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triag... [16:44:10] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4178982 (10BBlack) I don't think it was a flip of the two ports on the same card that was needed, but instead switching all the cables between the two cards (order of cards,... [16:46:04] 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4178994 (10RobH) p:05Triage>03Normal [16:46:54] 10Operations, 10Release-Engineering-Team, 10Scap, 10Scoring-platform-team: Deployment git server can't supply ORES hosts in parallel - https://phabricator.wikimedia.org/T191842#4178996 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in #operatio... [16:47:42] 10Operations, 10cloud-services-team, 10monitoring: Prometheus vs. CPU usage vs. hyperthreading - https://phabricator.wikimedia.org/T193272#4178999 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in #operations and attempting to review if any are c... [16:48:57] 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4179003 (10RobH) p:05Triage>03Normal I'm not quite sure if this is a normal or a high priority task. Seems normal, since we aren't requir... [16:51:26] 10Operations, 10Wikimedia-General-or-Unknown: Figure out why HHVM isn't using error_document404 setting - https://phabricator.wikimedia.org/T187754#4179027 (10RobH) p:05Triage>03Normal As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in #operations and attempting to review if an... [16:52:10] 10Operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#4179032 (10RobH) p:05High>03Low [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180503T1700). [17:01:24] no parsoid deploy today [17:01:28] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4179058 (10Vgutierrez) Ok... this is the current picture from what I see: eth0 is still connected to asw2-b:xe-4/0/34 instead of asw2-d:xe-7/0/15 asw2-c:xe-4/0/5 is showing n... [17:05:28] Nothing for ORES [17:07:10] (03PS1) 10Huji: Add several rights to eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T76553) [17:07:52] (03PS2) 10Huji: Add several rights to eliminators in fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) [17:08:56] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4179083 (10BBlack) [I still bet if you undo the already-done cable swap, and then switch the two cards' cables (leaving port1/2 ordering the same), this will all magically co... [17:09:48] (03CR) 10Huji: "To the reviewer: the rights this adds are those held by the patrollers, autopatrollers, and rollbackers. The previous patch did not take i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) (owner: 10Huji) [17:14:15] 10Operations: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766#4179109 (10herron) p:05Triage>03Normal [17:14:47] 10Operations: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766#4179122 (10herron) [17:14:49] 10Operations, 10Puppet: Knock down puppet 4 deprecation warnings - https://phabricator.wikimedia.org/T193664#4179121 (10herron) [17:17:49] (03CR) 10BryanDavis: [C: 031] maintain_kubeusers.pp: use require_package and add python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/430539 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [17:25:56] 10Operations, 10Discovery-Search: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649#4179195 (10debt) [17:27:30] 10Operations, 10Discovery-Search: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649#4179215 (10debt) moving from debian jessie to debian stretch [17:35:41] 10Operations, 10Wikidata: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733#4179258 (10Ladsgroup) >>! In T193733#4178239, @hoo wrote: > What makes you think that Terbium is to unstable for this? Terbium seems to always have more than enough spare resources, and the di... [17:36:26] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179259 (10JBennett) [17:36:36] 10Operations, 10Analytics: dbstore1002 disk 5 not healthy - https://phabricator.wikimedia.org/T193738#4179270 (10RobH) p:05High>03Normal [17:38:09] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179279 (10RobH) [17:38:58] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179259 (10RobH) [17:44:06] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179259 (10RobH) @JBennett: Can you clarify exactly what you need to access on logstash? Do you just need to read files, or administrate the node? I'm not really sure what you want to d... [17:44:08] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179308 (10RobH) @JBennett: Can you clarify exactly what you need to access on logstash? Do you just need to read files, or administrate the node? I'm not really sure what you want to d... [17:44:21] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179309 (10JBennett) Your Full Name: John Bennett developer access userid: jbennett ssh key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDIOdNYh9J4uSm7uuVZG7zttu/9Xtk5IaCPSokdOyhNnAoMBE51mnTZTr... [17:45:47] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179323 (10JBennett) I need to be able to access logstash to invistigate security incidents. So, i'll similar access to Brian Wolff or Sam Reed. [17:49:36] 10Operations, 10Traffic, 10Patch-For-Review: Gather 24h data cluster wide of AES128-SHA usage - https://phabricator.wikimedia.org/T193376#4179327 (10Vgutierrez) Data is currently being gathered, it can be seen here: https://logstash.wikimedia.org/app/kibana#/discover/958769b0-4eef-11e8-8e04-89a38b6a810e?_g=() [17:49:58] 10Operations, 10SRE-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179329 (10RobH) [17:50:41] 10Operations, 10Wikidata: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733#4179341 (10hoo) Ok, in that case this sounds like a valid request to get an own VM (or even bare metal server). Depending on how fast this can be done, this is a good short term solution for... [17:51:51] 10Operations, 10LDAP-Access-Requests: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179259 (10RobH) p:05Triage>03Normal a:03RobH IRC Update: John actually needs access to the https frontend, not shell access. So moving from #sre-access-requests to #ldap-access-... [17:55:53] (03PS1) 10RobH: adding john bennett to ldap users section [puppet] - 10https://gerrit.wikimedia.org/r/430635 (https://phabricator.wikimedia.org/T193771) [17:56:09] (03CR) 10RobH: [C: 032] adding john bennett to ldap users section [puppet] - 10https://gerrit.wikimedia.org/r/430635 (https://phabricator.wikimedia.org/T193771) (owner: 10RobH) [17:59:08] !log mw2254,mw2183,mw2184 - wmf-auto-reimage with stretch and raid/lvm [17:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:56] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Requesting access to Logstash for jbennett - https://phabricator.wikimedia.org/T193771#4179403 (10RobH) 05Open>03Resolved Confirmed wikitech account, added to ldap users section of admin module, and added to wmf ldap group. [18:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180503T1800). [18:00:05] AndyRussG: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:06] I'm here :) [18:01:56] Heads up, I'd like to do a security related deploy. Please don't do SWAT for the moment [18:02:17] o/ [18:02:34] Aww [18:02:56] bawolff: I think I'm the only one with a SWAT patch, just a teensy config change [18:03:19] AndyRussG: I'll be done shortly. Its for an urgent thing that's happening right now [18:03:28] (sorry) [18:03:30] Ah no sorry Krinkle I see u added somthing [18:03:38] bawolff: no prob! :) good luck [18:06:53] Krinkle: bawolff: so actually in that case I'm gonna relocate quickly, back in about 20 min.... thx! [18:08:20] Umm, did the hash for mwdebug1002 change at some point? [18:08:46] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#3561778 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts: ``` ['mw2229.codfw.wmnet', 'mw2231.codfw.w... [18:08:51] e.g. is the current hash I2tXXakPgkCcdEbBPZl80k6oZ8hmLNFL8ULa5rkpJZw because my ssh is warning about it having changed since last time I viewed it [18:09:03] 05-01 [18:09:04] 07:11 moritzm: reimaging mwdebug1002 to stretch [18:09:13] 2 days ago [18:09:19] Cool, thanks [18:09:32] PROBLEM - Host mw2240 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:35] bawolff: It just got upgraded to … yes. [18:09:42] PROBLEM - Host mw2229 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:47] !log mw2229,mw2231,mw2240 - wmf-auto-reimage with --new switch because their puppet cert wasn't found on puppetmaster, treated as new hosts that didnt exist before [18:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:12] RECOVERY - Host mw2240 is UP: PING WARNING - Packet loss = 80%, RTA = 37.41 ms [18:10:13] RECOVERY - Host mw2229 is UP: PING OK - Packet loss = 0%, RTA = 36.92 ms [18:10:32] PROBLEM - Nginx local proxy to apache on mw2240 is CRITICAL: connect to address 10.192.0.66 and port 443: Connection refused [18:10:42] PROBLEM - HHVM rendering on mw2231 is CRITICAL: connect to address 10.192.0.57 and port 80: Connection refused [18:10:42] PROBLEM - Apache HTTP on mw2229 is CRITICAL: connect to address 10.192.0.54 and port 80: Connection refused [18:11:27] mutante: if the host had already the cert removed the flag to use is --no-verify, not --new [18:11:35] just FYI, they are pretty similar [18:11:36] dowtntimed [18:11:51] volans: oh, ok thanks [18:12:12] mutante: also we weren't able to repro your issue with mori.tz [18:12:17] volans: the regular installs worked fine since the timeout was raised [18:12:20] all reimages were successfull, also in parallel [18:12:26] now they are, since that merged [18:12:41] ok, good to know :D we weren't able to repro even before the merge though :D [18:12:48] the ones you see above had a separate issue where the puppet cert is already gone [18:13:02] so you had a higher level than luckyness that we had :-P [18:13:02] PROBLEM - HHVM processes on mw2231 is CRITICAL: Return code of 255 is out of bounds [18:13:02] PROBLEM - nutcracker port on mw2231 is CRITICAL: Return code of 255 is out of bounds [18:13:02] PROBLEM - HHVM processes on mw2229 is CRITICAL: Return code of 255 is out of bounds [18:13:03] PROBLEM - Disk space on mw2240 is CRITICAL: Return code of 255 is out of bounds [18:13:03] PROBLEM - MD RAID on mw2231 is CRITICAL: Return code of 255 is out of bounds [18:13:03] PROBLEM - MD RAID on mw2229 is CRITICAL: Return code of 255 is out of bounds [18:13:03] PROBLEM - dhclient process on mw2231 is CRITICAL: Return code of 255 is out of bounds [18:13:04] PROBLEM - Disk space on mw2231 is CRITICAL: Return code of 255 is out of bounds [18:13:04] PROBLEM - nutcracker port on mw2229 is CRITICAL: Return code of 255 is out of bounds [18:13:05] PROBLEM - dhclient process on mw2229 is CRITICAL: Return code of 255 is out of bounds [18:13:12] PROBLEM - HHVM processes on mw2240 is CRITICAL: Return code of 255 is out of bounds [18:13:12] PROBLEM - DPKG on mw2229 is CRITICAL: Return code of 255 is out of bounds [18:13:12] PROBLEM - configured eth on mw2229 is CRITICAL: Return code of 255 is out of bounds [18:13:13] grr [18:13:13] PROBLEM - nutcracker process on mw2229 is CRITICAL: Return code of 255 is out of bounds [18:13:13] PROBLEM - Check whether ferm is active by checking the default input chain on mw2231 is CRITICAL: Return code of 255 is out of bounds [18:13:21] i was able to repro it with like 8 hosts in a row, but not anymore :p [18:13:22] PROBLEM - configured eth on mw2231 is CRITICAL: Return code of 255 is out of bounds [18:13:33] i just downtimed all that [18:14:14] well we can ignore it easily enough [18:16:52] PROBLEM - puppet last run on mw2231 is CRITICAL: Return code of 255 is out of bounds [18:18:43] ACKNOWLEDGEMENT - puppet last run on mw2231 is CRITICAL: Return code of 255 is out of bounds daniel_zahn known [18:23:15] disabled notifications for those until it's all clear [18:25:02] PROBLEM - Host mw2229 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:33] RECOVERY - Host mw2229 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [18:26:12] PROBLEM - Host mw2231 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:12] ACKNOWLEDGEMENT - Host mw2231 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reinstall [18:26:32] RECOVERY - Host mw2231 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [18:29:39] (03CR) 10Volans: [C: 031] "LGTM but please test it before merging" [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata) [18:30:18] bawolff: done with security patches? If so I can SWAT your change Krinkle [18:30:39] Its not 100% working as I expected, but I'm done for now, so you can go ahead [18:31:04] okie doke, thanks [18:31:38] bawolff: this is concerning https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen&orgId=1 [18:31:49] (03Abandoned) 10Ottomata: Kafka main-codfw patch 3 [puppet] - 10https://gerrit.wikimedia.org/r/430451 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [18:31:55] (03Abandoned) 10Ottomata: Kafka main-codfw patch 4 [puppet] - 10https://gerrit.wikimedia.org/r/430503 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [18:32:05] (03PS1) 10Ottomata: Kafka main-codfw patch 3 - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) [18:33:02] (03PS2) 10Dzahn: admin: update comments about terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430523 (https://phabricator.wikimedia.org/T192092) [18:33:34] Who is doing SWAT? [18:33:56] (03CR) 10Dzahn: [C: 032] admin: update comments about terbium to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430523 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:33:57] Hauskatze: YEs, I'm aware, and we are investigating [18:34:37] Jayprakash12345: I am [18:34:49] just noticed your patch [18:35:00] https://gerrit.wikimedia.org/r/#/c/430360/ Just simple please visit [18:36:14] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: Refactor varnishospital and varnishslowlog - https://phabricator.wikimedia.org/T193489#4179457 (10Gilles) 05Open>03Resolved [18:36:15] (03PS5) 10Thcipriani: Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [18:36:20] nick /AndyRussG [18:36:31] rrgh [18:37:30] thcipriani: hi! if ur swatting, feel like pushing out a wee config change? thx in advance! [18:37:59] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [18:38:11] AndyRussG: sure thing [18:38:18] thcipriani: thanks! :D [18:39:27] (03Merged) 10jenkins-bot: Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [18:40:09] (03PS3) 10Dzahn: rename mw1297 to mwmaint1001, partman for mwmaint* [puppet] - 10https://gerrit.wikimedia.org/r/430519 (https://phabricator.wikimedia.org/T192185) [18:40:56] (03PS4) 10Dzahn: rename mw1297 to mwmaint1001, partman for mwmaint* [puppet] - 10https://gerrit.wikimedia.org/r/430519 (https://phabricator.wikimedia.org/T192185) [18:41:08] Jayprakash12345: you change is live on mwdebug1002, check please [18:41:20] ok [18:41:44] (03PS2) 10Thcipriani: CentralNotice EventLogging banner impression data test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430613 (https://phabricator.wikimedia.org/T183978) (owner: 10AndyRussG) [18:41:46] looks good, go ahead [18:41:51] * thcipriani syncs [18:44:25] (03CR) 10Dzahn: [C: 032] "partman recipe choice per comment on https://gerrit.wikimedia.org/r/#/c/430518/" [puppet] - 10https://gerrit.wikimedia.org/r/430519 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn) [18:44:37] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:430360|Enable ULS webfonts by default at Bengali Wikisource]] T193367 (duration: 01m 18s) [18:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:42] T193367: Enable ULS webfonts by default at Bengali Wikisource - https://phabricator.wikimedia.org/T193367 [18:44:47] ^ Jayprakash12345 should be live now [18:45:37] mutante: got a lot of hostkey warnings doing scap, I assume that those are due to reimaging happening? [18:46:28] thcipriani: oh, unfortunately yes, there were special cases i am fixing [18:46:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430613 (https://phabricator.wikimedia.org/T183978) (owner: 10AndyRussG) [18:47:23] (03Merged) 10jenkins-bot: CentralNotice EventLogging banner impression data test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430613 (https://phabricator.wikimedia.org/T183978) (owner: 10AndyRussG) [18:47:37] thcipriani: mw2229,mw2231 and mw2240 | i'll try to clean it up [18:47:38] cool, just wanted to make sure :) [18:47:57] Krinkle: you change is live on mwdebug1002, check please [18:48:03] *your [18:48:43] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2231.codfw.wmnet [18:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:57] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2229.codfw.wmnet [18:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:08] !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mw2240.codfw.wmnet [18:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:29] AndyRussG: I've also pulled over your change to mwdebug1002, if there's anything you can/want to check there. [18:51:26] (03CR) 10Dzahn: "ack, amended and merged using the mw-raid1-lvm recipe in https://gerrit.wikimedia.org/r/#/c/430519/" [dns] - 10https://gerrit.wikimedia.org/r/430518 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [18:51:30] checking [18:51:40] thcipriani: Thanks :) [18:52:09] yw :) [18:52:25] (03PS3) 10Dzahn: rename wmf6936 from mw1297 to mwmaint1001 [dns] - 10https://gerrit.wikimedia.org/r/430518 (https://phabricator.wikimedia.org/T192457) [18:54:31] thcipriani: Hm.. not seeing it [18:54:31] (03CR) 10Dzahn: [C: 032] rename wmf6936 from mw1297 to mwmaint1001 [dns] - 10https://gerrit.wikimedia.org/r/430518 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [18:54:40] thcipriani: checking! [18:54:55] lemme make sure I pulled everything... [18:56:50] OK. got it now, sorry about that. [18:56:53] was checking enwiki [18:56:54] :/ [18:56:58] LGTM [18:57:05] cool, going live [18:57:57] (03PS3) 10Dzahn: relforge/mariadb-labtest: adjust terbium comments, rename ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/430526 (https://phabricator.wikimedia.org/T192092) [18:58:25] (03PS4) 10Dzahn: relforge/mariadb-labtest: adjust terbium comments, rename ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/430526 (https://phabricator.wikimedia.org/T192092) [18:58:45] thcipriani: lgtm! [18:59:01] AndyRussG: awesome, after the current sync is complete, I'll get your change out [18:59:10] thcipriani: okok thanks [18:59:11] (03PS5) 10Dzahn: relforge/mariadb-labtest: adjust terbium comments, rename ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/430526 (https://phabricator.wikimedia.org/T192092) [18:59:33] !log thcipriani@tin Synchronized php-1.32.0-wmf.2/extensions/NavigationTiming/modules/ext.navigationTiming.js: SWAT: [[gerrit:430634|Emit SaveTiming without relying on getNavTiming()]] T193693 (duration: 01m 16s) [18:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:38] T193693: Exception in load-callback for schema.SaveTiming: "ext.navigationTiming.rumSpeedIndex" not loaded - https://phabricator.wikimedia.org/T193693 [18:59:39] ^ Krinkle live now [18:59:40] (03CR) 10Dzahn: [C: 032] "just comments and adjusting a resource name to remove hardcoded "terbium", not actually changing ferm rules or anything" [puppet] - 10https://gerrit.wikimedia.org/r/430526 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [19:00:04] no_justification: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180503T1900). [19:01:18] (03PS2) 10Dzahn: tcpircbot: add mwmaint1001 to ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/430529 (https://phabricator.wikimedia.org/T192092) [19:01:45] thcipriani: confirmed [19:01:50] PROBLEM - mediawiki-installation DSH group on mw2183 is CRITICAL: Host mw2183 is not in mediawiki-installation dsh group [19:01:51] PROBLEM - mediawiki-installation DSH group on mw2184 is CRITICAL: Host mw2184 is not in mediawiki-installation dsh group [19:01:51] PROBLEM - Disk space on mw2183 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:01:51] PROBLEM - Disk space on mw2184 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:01:51] PROBLEM - Nginx local proxy to apache on mw2231 is CRITICAL: connect to address 10.192.0.57 and port 443: Connection refused [19:01:51] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:01:51] PROBLEM - nutcracker process on mw2229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:01:58] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:430613|CentralNotice EventLogging banner impression data test]] T183978 (duration: 01m 04s) [19:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:04] T183978: [Epic] Kafkatee changes - https://phabricator.wikimedia.org/T183978 [19:02:04] ^ AndyRussG live now [19:03:30] PROBLEM - HHVM processes on mw2183 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:03:30] PROBLEM - nutcracker port on mw2183 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:03:30] PROBLEM - HHVM processes on mw2184 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:03:30] PROBLEM - nutcracker port on mw2184 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:03:30] PROBLEM - puppet last run on mw2229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:03:31] PROBLEM - Check whether ferm is active by checking the default input chain on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:03:31] PROBLEM - Check size of conntrack table on mw2240 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:03:32] PROBLEM - MD RAID on mw2240 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:03:37] (03PS3) 10Dzahn: mariadb: add mwmaint1001 to grants for production-m5 [puppet] - 10https://gerrit.wikimedia.org/r/430524 (https://phabricator.wikimedia.org/T192092) [19:04:06] thcipriani: K checking [19:04:58] thcipriani: I think it takes a few minutes at least for a config change to make it way to JS, must be the normal RL turnover [19:05:01] PROBLEM - HHVM rendering on mw2229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:10] PROBLEM - HHVM rendering on mw2183 is CRITICAL: connect to address 10.192.32.71 and port 80: Connection refused [19:05:10] PROBLEM - HHVM rendering on mw2184 is CRITICAL: connect to address 10.192.32.72 and port 80: Connection refused [19:05:10] PROBLEM - nutcracker process on mw2183 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:05:10] PROBLEM - nutcracker process on mw2184 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:05:11] i'm sorry guys, please dont mind the ones starting with mw21/mw22, it will be gone [19:05:41] (03PS1) 10Zhuyifei1999: Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) [19:05:43] (03CR) 10jenkins-bot: Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [19:05:49] (03CR) 10jenkins-bot: CentralNotice EventLogging banner impression data test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430613 (https://phabricator.wikimedia.org/T183978) (owner: 10AndyRussG) [19:06:34] (03CR) 10jerkins-bot: [V: 04-1] Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [19:07:31] thcipriani: yep all good! [19:07:35] thx again :D [19:07:46] yw! :) [19:08:25] (fwiw we're gonna turn off the feature we just turned on in a few hours, this was just a test to get some EL data in Hive to fiddle with...) [19:11:46] PROBLEM - Check systemd state on mw2183 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:46] PROBLEM - Check systemd state on mw2184 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:11:56] PROBLEM - HHVM rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:16] PROBLEM - MD RAID on mw2229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:12:16] PROBLEM - Check size of conntrack table on mw2229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:12:16] PROBLEM - nutcracker process on mw2229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:12:56] PROBLEM - nutcracker process on mw2231 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:13:06] RECOVERY - Check size of conntrack table on mw2229 is OK: OK: nf_conntrack is 0 % full [19:13:06] RECOVERY - MD RAID on mw2229 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [19:13:17] PROBLEM - nutcracker port on mw2231 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [19:13:20] (03PS2) 10Zhuyifei1999: Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) [19:13:23] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4179677 (10jeblad) [19:14:01] (03CR) 10jerkins-bot: [V: 04-1] Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [19:15:47] PROBLEM - Check systemd state on mw2229 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:16:15] /me disables notifications as fast as possible but some are slipping through [19:18:41] !log mw1297 - puppet node clean, puppet node deactivate - renaming to mwmaint1001 (T192185) [19:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:46] T192185: request to assign spare systems as terbium equivalent - https://phabricator.wikimedia.org/T192185 [19:19:23] RECOVERY - nutcracker process on mw2229 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [19:19:53] RECOVERY - Check systemd state on mw2229 is OK: OK - running: The system is fully operational [19:20:04] (03PS1) 10AndyRussG: Turn off CentralNotice EventLogging impression data following test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430649 (https://phabricator.wikimedia.org/T183978) [19:23:29] (03PS3) 10Dzahn: Enable base::service_auto_restart for jenkins on release servers [puppet] - 10https://gerrit.wikimedia.org/r/430562 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [19:24:41] (03CR) 10Dzahn: [C: 032] Enable base::service_auto_restart for jenkins on release servers [puppet] - 10https://gerrit.wikimedia.org/r/430562 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [19:25:55] 10Operations, 10HHVM, 10Patch-For-Review, 10User-Elukey: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431#4179716 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw2231.codfw.wmnet', 'mw2229.codfw.wmnet', 'mw2240.codfw.wmnet'] ``` and were **ALL*... [19:26:29] (03CR) 10Dzahn: [C: 032] "releases1001: ../Cron[wmf_auto_restart_jenkins]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/430562 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [19:31:44] (03CR) 10Zhuyifei1999: "No idea what debian-glue is complaining abut" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [19:32:56] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2229.codfw.wmnet [19:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:58] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2231.codfw.wmnet [19:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:53] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2240.codfw.wmnet [19:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:07] Staging on mwdebug1002 [19:53:13] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2254.codfw.wmnet [19:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:41] (03PS3) 10Zhuyifei1999: Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) [19:59:29] (03CR) 10jerkins-bot: [V: 04-1] Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [20:00:23] !log mw2185,mw2186,mw2188 - reinstall with stretch [20:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:23] (03CR) 10EBernhardson: Forward response codes >= 400 on search.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430502 (https://phabricator.wikimedia.org/T179266) (owner: 10EBernhardson) [20:06:07] (03PS3) 10EBernhardson: Forward response codes >= 400 on search.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430502 (https://phabricator.wikimedia.org/T179266) [20:08:23] 10Operations, 10monitoring: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4179842 (10Dzahn) [20:08:44] 10Operations, 10monitoring: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4179853 (10Dzahn) {F17621797} [20:09:06] 10Operations, 10monitoring: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4179857 (10Dzahn) [20:09:13] (03CR) 10Krinkle: [C: 031] Forward response codes >= 400 on search.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430502 (https://phabricator.wikimedia.org/T179266) (owner: 10EBernhardson) [20:09:30] !log krinkle@tin Synchronized php-1.32.0-wmf.1/extensions/NavigationTiming: I1e7f091cba1 (duration: 01m 18s) [20:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:39] 10Operations, 10monitoring: Icinga SMART check returns OK when not getting data - https://phabricator.wikimedia.org/T193793#4179842 (10Dzahn) p:05Triage>03Normal [20:11:01] * Krinkle releases scap hammer [20:11:44] (03CR) 10EBernhardson: "with new swat rules this needs to be split. Otherwise it looks ready to deploy?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [20:17:02] (03PS1) 10Dzahn: mwmaint1001: add mediawiki-maintenance role [puppet] - 10https://gerrit.wikimedia.org/r/430674 (https://phabricator.wikimedia.org/T192092) [20:21:03] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign spare systems as terbium equivalent - https://phabricator.wikimedia.org/T192185#4179941 (10Dzahn) wmf6936 (mw1297) assigned and renamed to mwmaint1001 i renamed in racktables and left a comment there too racktables object 3003 https:... [20:21:28] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign spare systems as terbium equivalent - https://phabricator.wikimedia.org/T192185#4179942 (10Dzahn) a:05Dzahn>03RobH [20:22:49] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign spare systems as terbium equivalent - https://phabricator.wikimedia.org/T192185#4131143 (10Dzahn) P.S. I also added the "mwmaint" prefix on https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [20:24:11] (03CR) 10Dzahn: [C: 032] mwmaint1001: add mediawiki-maintenance role [puppet] - 10https://gerrit.wikimedia.org/r/430674 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [20:28:42] 10Operations, 10Patch-For-Review: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4179971 (10RobH) [20:28:45] 10Operations, 10hardware-requests, 10Patch-For-Review: request to assign spare systems as terbium equivalent - https://phabricator.wikimedia.org/T192185#4179970 (10RobH) 05Open>03Resolved [20:29:51] 10Operations, 10ops-eqiad: change hostname label for mw1297 to mwmaint1001 - https://phabricator.wikimedia.org/T193798#4179983 (10RobH) p:05Triage>03Low [20:30:07] !log mw1297 - reinstalling as mwmaint1001 [20:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:41] !log started peering/transit with Deutsche Telekom on cr2-esams [20:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:41] (03PS1) 10Dzahn: Revert "mwmaint1001: add mediawiki-maintenance role" [puppet] - 10https://gerrit.wikimedia.org/r/430796 [20:45:59] (03PS1) 10EBernhardson: Promote MLR models from AB test to prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430797 (https://phabricator.wikimedia.org/T187148) [20:46:03] (03CR) 10Dzahn: [C: 032] "first run with just base stuff.. then add the role again" [puppet] - 10https://gerrit.wikimedia.org/r/430796 (owner: 10Dzahn) [20:48:21] * AaronSchulz takes the hammer [20:50:16] 10Operations, 10CirrusSearch, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Alert when elasticsearch writes are frozen for too long - https://phabricator.wikimedia.org/T193605#4180091 (10EBernhardson) [21:00:57] (03PS4) 10Zhuyifei1999: Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) [21:01:37] (03CR) 10jerkins-bot: [V: 04-1] Load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [21:02:19] PROBLEM - mediawiki-installation DSH group on mw2188 is CRITICAL: Host mw2188 is not in mediawiki-installation dsh group [21:02:19] PROBLEM - puppet last run on mw2185 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:02:19] PROBLEM - puppet last run on mw2186 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:02:19] PROBLEM - Disk space on mw2188 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:03:49] PROBLEM - Apache HTTP on mw2185 is CRITICAL: connect to address 10.192.32.73 and port 80: Connection refused [21:03:49] PROBLEM - Apache HTTP on mw2186 is CRITICAL: connect to address 10.192.32.74 and port 80: Connection refused [21:03:58] PROBLEM - HHVM processes on mw2188 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:03:58] PROBLEM - nutcracker port on mw2188 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:05:29] PROBLEM - HHVM rendering on mw2188 is CRITICAL: connect to address 10.192.32.76 and port 80: Connection refused [21:05:29] PROBLEM - Check size of conntrack table on mw2186 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:05:29] PROBLEM - Check size of conntrack table on mw2185 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:05:29] PROBLEM - MD RAID on mw2185 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:05:29] PROBLEM - MD RAID on mw2186 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:05:29] PROBLEM - nutcracker process on mw2188 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:06:19] handling it [21:07:09] PROBLEM - Check systemd state on mw2186 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:07:09] PROBLEM - puppet last run on mw2188 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:08:36] !log aaron@tin Started scap: Deploy db9acea7eb1c717104691857d1b3ce73c2e18847 (bug T193668) [21:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:41] T193668: Transaction should be in the callback stage (not 'cursory') - https://phabricator.wikimedia.org/T193668 [21:16:41] AaronSchulz tyvm [21:18:27] (03PS1) 10Ottomata: eventlogging service logstash with gelf [puppet] - 10https://gerrit.wikimedia.org/r/430808 (https://phabricator.wikimedia.org/T193230) [21:18:31] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2184.codfw.wmnet [21:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:34] (03PS2) 10Ottomata: eventlogging service logstash with gelf [puppet] - 10https://gerrit.wikimedia.org/r/430808 (https://phabricator.wikimedia.org/T193230) [21:21:04] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2183.codfw.wmnet [21:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:44] (03PS1) 10Dzahn: Revert "Revert "mwmaint1001: add mediawiki-maintenance role"" [puppet] - 10https://gerrit.wikimedia.org/r/430812 [21:32:04] gilles: Once scap is done I'll roll out the navtiming changes. [21:33:50] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Global renames get stuck at ty.wikipedia - https://phabricator.wikimedia.org/T193790#4180244 (10Stryn) [21:36:08] 10Operations, 10GlobalRename, 10MediaWiki-JobQueue, 10MediaWiki-extensions-CentralAuth, 10Wikimedia-log-errors: Global renames get stuck at ty.wikipedia - https://phabricator.wikimedia.org/T193790#4180252 (10alanajjar) [21:37:36] 10Operations, 10GlobalRename, 10MediaWiki-JobQueue, 10MediaWiki-extensions-CentralAuth, and 2 others: Global renames get stuck at ty.wikipedia - https://phabricator.wikimedia.org/T193790#4179774 (10alanajjar) [21:41:45] Krinkle: Once scap is done I'd like to resume the train [21:52:00] (03PS1) 10Dzahn: mw-maintenance: add PHP7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) [21:52:33] (03CR) 10Paladox: [C: 031] mw-maintenance: add PHP7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [21:52:41] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: add PHP7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [21:53:43] (03CR) 10Dzahn: "profile 'profile::mediawiki::maintenance' includes non-profile class mediawiki::packages::php7 booohooo :p" [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [21:59:31] (03PS2) 10Dzahn: mw-maintenance: add PHP7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) [21:59:58] (03PS1) 10Thcipriani: Scap: bump version to 3.8.1-1 [puppet] - 10https://gerrit.wikimedia.org/r/430820 (https://phabricator.wikimedia.org/T127762) [22:00:13] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: add PHP7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [22:05:01] RECOVERY - mediawiki-installation DSH group on mw2184 is OK: OK [22:05:11] RECOVERY - mediawiki-installation DSH group on mw2183 is OK: OK [22:06:57] (03PS3) 10Dzahn: mw-maintenance: add PHP7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) [22:06:59] no_justification: Aye, that's fine. But there's one thing though. Which is that navtiming isn't in a good place in wmf.2 [22:07:20] I resolved the blocker thinking it was new in wmf.2, but it's in wmf.1, which I've got a backport ready for now. [22:07:29] That should be in before it goes to all wikis. [22:07:33] If possible :) [22:07:49] (03CR) 10jerkins-bot: [V: 04-1] mw-maintenance: add PHP7/stretch support [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [22:07:52] Anyway, it's been about 7 days, so I can wait until after the train, that's fine too. [22:10:44] (03CR) 10Smalyshev: [C: 031] "> Patch Set 15:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [22:12:31] * AaronSchulz wonders how long scap should takr [22:12:34] *take [22:12:43] no_justification: Details at https://phabricator.wikimedia.org/T193570#4180377 [22:12:55] it's been 45min? [22:13:15] Go ahead [22:13:19] First Krinkle [22:13:23] Yes, l10n takes 30+ mins [22:13:29] That's full scap for you [22:13:41] Except when it's fast :) [22:13:49] htop still shows lots of stuff going on [22:13:50] It hasn't taken this long since... 2012? [22:14:13] btw, have we applied the Jit fix to prod use of scap/hhvm/l10n rebuild yet? [22:14:20] That seemed to improve things from the task I checked. [22:14:27] from 30+ to 16min or so [22:14:27] We also haven't run l10n rebuild on hhvm [22:14:38] I'll remember to use screen next time...maybe grab a sandwich ;) [22:14:59] It wasn't fast on tuesday for initial wmf.2 bootstrap [22:15:37] 'haven't run l10n rebuild on hhvm ' - not sure I follow, I assume we've scapped at least once since the (last) php5>hhvm switch? [22:16:45] I mean we hadn't been running on it for years so that's why it's fast [22:16:49] But now we do again, so it's slow [22:16:51] :) [22:17:03] And I have nfc if the Jit thing has been applied in prod [22:19:04] 10Operations, 10Mail: E-mail for people in different OIT LDAP object unit - https://phabricator.wikimedia.org/T159750#4180392 (10bbogaert) Hi @herron , I have verified we can make Organizational Units under people without affecting mail flow, so this good! I just want to make sure it does not affect any othe... [22:22:37] no_justification: hm.. I don't recall specifics but I don't think we ran mwscript on hhvm very long, did we? Anyhow, my main memories from scap being slow predate hhvm overall. [22:23:01] Like, you know, our scap used to take 1hr+, and that was on php5 [22:23:14] and we ran it a few times a year [22:23:46] Those are unrelated. Since swapping to hhvm the l10n stage of scap has been slower than php5 [22:23:58] (scap is generally faster. This is a regression) [22:25:20] Right, these past few weeks only. [22:25:27] because hhvm on cli is slow. [22:26:21] I was saying, hhvm's sluggish performance on CLI has basically brought 'scap' (effectively) back to how it was in 2011 before y'all made it more awesome. [22:28:04] Yes, I agree. I thought you were confused as to why it's slow ;-) [22:28:10] (03CR) 10Krinkle: idwikimedia: initial configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429385 (https://phabricator.wikimedia.org/T192726) (owner: 10MarcoAurelio) [22:28:32] sync-masters! [22:31:00] https://wikitech.wikimedia.org/w/index.php?title=Wikimedia_binaries&oldid=46762#Mysterious_or_obsolete_things? [22:40:43] (03CR) 10Dzahn: "why does it still vote -1 even after adding lint:ignore ?" [puppet] - 10https://gerrit.wikimedia.org/r/430817 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [22:40:54] 10Operations, 10decommission: Decommission notebook1001 - https://phabricator.wikimedia.org/T192103#4180486 (10RobH) [22:40:55] (03CR) 10BryanDavis: [C: 031] wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [22:41:13] 10Operations, 10ops-eqiad, 10decommission: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025#4180489 (10RobH) [22:41:27] 10Operations, 10ops-eqiad, 10decommission, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420#4180490 (10RobH) [22:41:28] no_justification: Meh, well, seems like it's not unusually slow yet. I mean, for hhvm. I thought it was gonna be 35-40min but I guess that was just an odd sample, besides that's only l10n rebuild itself. Yours from May 1 was 1h and 15min. [22:42:08] And it's been... 1h and 30min now. [22:42:09] 10Operations, 10ops-esams, 10decommission: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#4180492 (10RobH) [22:42:14] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission graphite1002 - https://phabricator.wikimedia.org/T187190#4180493 (10RobH) [22:42:53] 10Operations, 10DBA, 10decommission, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4180499 (10RobH) [22:48:08] (03PS7) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) [22:49:41] 10Operations, 10ops-eqiad, 10decommission: decom spare server lawrencium/WMF3542 - https://phabricator.wikimedia.org/T191360#4180526 (10RobH) [22:49:44] 10Operations, 10decommission: decom bast1001 - https://phabricator.wikimedia.org/T191153#4180527 (10RobH) [22:49:50] 10Operations, 10Cloud-Services, 10DC-Ops, 10decommission: decom californium - https://phabricator.wikimedia.org/T189921#4180529 (10RobH) [22:49:52] 10Operations, 10decommission: Reclaim/Decommission Silver.wikimedia.org - https://phabricator.wikimedia.org/T190085#4180530 (10RobH) [22:49:54] (03CR) 10Bstorm: "I'm going to merge this as is and consider future refactoring as long as it works well :)" [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [22:50:06] (03CR) 10Bstorm: [C: 032] wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [22:50:37] (03PS1) 10Chad: Scap plugins: Add __init__.py so python treats this as a package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430823 [22:52:27] !log aaron@tin Finished scap: Deploy db9acea7eb1c717104691857d1b3ce73c2e18847 (bug T193668) (duration: 103m 50s) [22:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:32] T193668: Transaction should be in the callback stage (not 'cursory') - https://phabricator.wikimedia.org/T193668 [22:56:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission uranium/WMF3128 - https://phabricator.wikimedia.org/T191348#4180536 (10RobH) [22:56:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom vanadium/WMF3291 - https://phabricator.wikimedia.org/T191351#4180538 (10RobH) [22:56:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom zinc/WMF3298 - https://phabricator.wikimedia.org/T191352#4180540 (10RobH) [22:57:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom niobium/WMF3428 - https://phabricator.wikimedia.org/T191355#4180541 (10RobH) [22:57:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357#4180542 (10RobH) [22:57:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362#4180543 (10RobH) [22:57:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom spare server osmium/wmf4546 - https://phabricator.wikimedia.org/T191364#4180544 (10RobH) [22:57:48] 10Operations, 10ops-codfw, 10decommission: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467#4180545 (10RobH) [22:57:52] 10Operations, 10ops-eqiad, 10decommission, 10User-Joe: Decommission mw1201-mw1220 - https://phabricator.wikimedia.org/T185004#4180546 (10RobH) [22:58:12] 10Operations, 10decommission: Decommission old server wmf4077 - https://phabricator.wikimedia.org/T190086#4180547 (10RobH) [22:58:15] 10Operations, 10ops-eqiad, 10decommission: Decommission eventlog1001 - https://phabricator.wikimedia.org/T189566#4180548 (10RobH) [22:58:32] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474#4180549 (10RobH) [22:58:43] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1034 - https://phabricator.wikimedia.org/T182556#4180550 (10RobH) [22:58:59] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#4180551 (10RobH) [22:59:12] 10Operations, 10ops-esams, 10decommission: Decommission cp300[3456] - https://phabricator.wikimedia.org/T167376#4180552 (10RobH) [22:59:25] I'm landing the navtimg commits [22:59:29] will take a few minutes for Jenkins [22:59:37] AaronSchulz: Are you done / have verified? [22:59:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473#4180555 (10RobH) [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180503T2300). Please do the needful. [23:00:04] AndyRussG: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:09] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1039 - https://phabricator.wikimedia.org/T184262#4180561 (10RobH) [23:00:15] 10Operations, 10Analytics, 10decommission, 10Patch-For-Review: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#4180562 (10RobH) [23:00:33] 10Operations, 10ops-eqiad, 10decommission: Decommission ocg1001-3 - https://phabricator.wikimedia.org/T177958#4180563 (10RobH) [23:00:38] 10Operations, 10ops-eqiad, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#4180564 (10RobH) [23:00:55] 10Operations, 10ops-eqiad, 10decommission, 10netops: unrack/decom pfw1-eqiad and pfw2-eqiad - https://phabricator.wikimedia.org/T183390#4180565 (10RobH) [23:01:37] 10Operations, 10ops-eqiad, 10Packaging, 10decommission: Decommission host copper.eqiad.wmnet - https://phabricator.wikimedia.org/T176957#4180570 (10RobH) [23:02:05] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#4180571 (10RobH) [23:02:19] 10Operations, 10ops-ulsfo, 10decommission: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#4180572 (10RobH) [23:02:28] 10Operations, 10ops-ulsfo, 10decommission: Decommission cp4011, cp4012, cp4019, cp4020 - https://phabricator.wikimedia.org/T167377#4180573 (10RobH) [23:02:41] 10Operations, 10ops-esams, 10decommission: Decommission bast3001 - https://phabricator.wikimedia.org/T159480#4180575 (10RobH) [23:02:47] 10Operations, 10ops-esams, 10DC-Ops, 10decommission: Decomission amssq31-62 (32 hosts) - https://phabricator.wikimedia.org/T95742#4180577 (10RobH) [23:03:06] 10Operations, 10ops-esams, 10decommission: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#4180578 (10RobH) [23:03:17] 10Operations, 10DBA, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#4180580 (10RobH) [23:03:38] helloooooooooooooo [23:04:19] Here for evening SWAT.... [23:05:23] AndyRussG: Bit behind schedule, train is happening first, I think. [23:06:12] I'll start with the wmf.1 patch for now, awaiting Aaron's return. [23:09:32] !log krinkle@tin Synchronized php-1.32.0-wmf.1/extensions/NavigationTiming/: If293a156cac / T193570 (duration: 01m 17s) [23:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:36] T193570: Why did first paint/fully loaded time drop (in a good way!) on mobile? - https://phabricator.wikimedia.org/T193570 [23:10:42] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2188.codfw.wmnet [23:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:57] Krinkle: ah okok I'm around for the next few hours, if someone would like to ping whenever then... :) [23:11:03] thx!!! [23:12:40] !log krinkle@tin Synchronized php-1.32.0-wmf.2/extensions/NavigationTiming/: If293a156ca / T193570 (duration: 01m 16s) [23:12:43] no_justification: all yours now. [23:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:35] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2185.codfw.wmnet [23:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:06] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2186.codfw.wmnet [23:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:17] (03PS1) 10Chad: Various pylint fixes to scap plugins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430825 [23:23:36] (03PS2) 10Chad: Scap plugins: Add __init__.py so python treats this as a package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430823 [23:26:49] (03CR) 10Chad: [C: 032] Various pylint fixes to scap plugins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430825 (owner: 10Chad) [23:28:17] (03Merged) 10jenkins-bot: Various pylint fixes to scap plugins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430825 (owner: 10Chad) [23:30:55] (03PS3) 10Chad: Scap plugins: Add __init__.py so python treats this as a package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430823 [23:31:03] (03CR) 10Chad: [C: 032] Scap plugins: Add __init__.py so python treats this as a package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430823 (owner: 10Chad) [23:31:37] Krinkle: that change was the only thing [23:31:45] (03PS1) 10Chad: group1 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430827 [23:31:53] (03CR) 10Chad: [C: 032] group1 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430827 (owner: 10Chad) [23:33:58] (03PS1) 10Niharika29: Up the config temporarily to prevent loginnotify fail attempt emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430829 (https://phabricator.wikimedia.org/T193762) [23:34:06] (03Merged) 10jenkins-bot: Scap plugins: Add __init__.py so python treats this as a package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430823 (owner: 10Chad) [23:34:08] (03Merged) 10jenkins-bot: group1 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430827 (owner: 10Chad) [23:35:56] (03CR) 10Jforrester: [C: 031] "Seems sane." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430829 (https://phabricator.wikimedia.org/T193762) (owner: 10Niharika29) [23:39:07] (03CR) 10Brian Wolff: [C: 031] Up the config temporarily to prevent loginnotify fail attempt emails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430829 (https://phabricator.wikimedia.org/T193762) (owner: 10Niharika29) [23:39:32] I can deploy that. [23:39:48] James_F: Good to deploy? [23:40:18] Niharika: no_justification is just doing the train; after that? [23:40:39] Oh, missed that. This is the SWAT window, I thought. [23:40:46] 10Operations, 10hardware-requests: Find and rack 2 EX4200s in rack c1-eqiad - https://phabricator.wikimedia.org/T139752#4180662 (10RobH) 05Open>03declined not sure what we need this for any longer, declining [23:40:50] Yup, later. [23:41:39] It is, but things are running late. [23:42:12] James_F: Niharika: should I wait around to see if SWAT happens [23:42:16] ? [23:42:37] (I can be around all evening, just should take about 20 minutes out to walk an anxious doggo...) [23:43:11] AndyRussG: Nothing will happen re. SWAT in the next 20 minutes, at least. [23:43:13] !log demon@tin Synchronized scap/plugins/: cleanup, no-op (duration: 01m 17s) [23:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:29] 10Operations, 10hardware-requests: Replacement hardware for cumin masters - https://phabricator.wikimedia.org/T178392#4180666 (10RobH) 05Open>03stalled These are being tracked for replacement on the misc system order, T189317. [23:43:42] James_F: okok thanks! [23:43:44] :) [23:43:51] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4180671 (10RobH) 05stalled>03Resolved [23:44:28] 10Operations, 10ops-esams, 10Epic: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) - https://phabricator.wikimedia.org/T184061#4180678 (10RobH) [23:44:30] 10Operations, 10ops-esams, 10Traffic, 10hardware-requests: Procure and install LVS and miscellaneous servers - https://phabricator.wikimedia.org/T184068#4180674 (10RobH) 05Open>03Resolved This is now being tracked via the procurement task, T183413. [23:49:57] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4179231 (10Paladox) [23:56:02] !log demon@tin Synchronized php: symlink bump (duration: 01m 16s) [23:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:11] 10Operations, 10Security-Team: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769#4180711 (10Xaosflux) [23:59:23] 10Operations, 10ops-esams, 10DC-Ops, 10netops, 10procurement: esams: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176337#4180716 (10RobH) 05Open>03Resolved racktables now has this as https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=3546