[00:00:08] And apparently 300 more users by running initSiteStats [00:00:36] Lmfao. [00:00:58] The semantic stuff must artificially inflate content pages [00:01:15] Actual pages went up by 70 [00:01:26] bd808: What's the list? [00:04:45] Reedy: https://phabricator.wikimedia.org/P5404 -- I'm working from the top down if you want to start from the bottom up. [00:06:17] gawd we have a lot of content duplication [00:12:49] (03PS1) 10Dzahn: icinga: add another contact, use local var in .erb [puppet] - 10https://gerrit.wikimedia.org/r/352731 [00:14:06] (03PS2) 10Dzahn: icinga: back to multi-level hierarchy, add another contact, use local var in .erb [puppet] - 10https://gerrit.wikimedia.org/r/352731 [00:16:34] (03PS3) 10Dzahn: icinga: back to multi-level hierarchy, add another contact, use local var in .erb [puppet] - 10https://gerrit.wikimedia.org/r/352731 [00:18:01] bd808: want to grep for :: too? Incase there's any weird property stuff we missed [00:19:22] w00t a new grep for "FormEdit" is empty [00:20:00] Reedy: "::" matches 698 pages still... [00:20:13] oh aye? [00:20:24] I'm just running runJobs as I just fixed another... [00:20:29] Might be worth trying again when it finishes [00:21:17] It's definitely removing links [00:21:25] looks like Template:Server was one of the culprits. [00:24:13] Reedy: heh. many matches are from talk page indenting [00:24:36] (03PS4) 10Dzahn: icinga: back to multi-level hierarchy, add another contact, use local var in .erb [puppet] - 10https://gerrit.wikimedia.org/r/352731 [00:25:14] sounds about right [00:26:34] this grep looks more promising -- https://phabricator.wikimedia.org/P5404#29024 [00:29:31] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation [00:29:35] [0ccc496a8202bc464e31f667] 2017-05-09 00:29:26: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" [00:29:38] It broke [00:29:49] poop [00:30:12] edit still works [00:30:25] I see an #ask on https://wikitech.wikimedia.org/w/index.php?title=Help:Labs-vagrant/Hosts&action=edit [00:30:50] yeah. i'll look for #asks next. you can nuke that one [00:31:49] A few :: are puppet false +ve [00:32:35] Access denied for user 'wikiuser'@'208.80.154.136' to database 'labswiki' [00:32:46] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation works again [00:32:46] Query: CREATE TEMPORARY TABLE `t5`( id INT UNSIGNED KEY ) ENGINE=MEMORY [00:32:55] Function: SMW::executeQueries [00:33:03] just remove all that crap [00:36:09] we're trying :P [00:36:58] * MaxSem applauds SMW for using descriptive table names [00:38:16] something something clean code something something [00:39:04] heathen unholy magick [00:39:14] bd808: Want to re-run the :: please? [00:39:29] Counting number of articles...4768 [00:39:31] Much lols [00:40:07] Reedy: https://phabricator.wikimedia.org/P5404#29026 [00:40:18] and #ask -- https://phabricator.wikimedia.org/P5404#29025 [00:41:06] Numerous of those ask will disappear when :: is done [00:41:29] TRUNCATE TABLE page, anyone? [00:46:47] Counting number of articles...2493 [00:48:41] (03CR) 10Dzahn: [C: 032] icinga: back to multi-level hierarchy, add another contact, use local var in .erb [puppet] - 10https://gerrit.wikimedia.org/r/352731 (owner: 10Dzahn) [00:53:12] Reedy: w00t -- https://phabricator.wikimedia.org/P5404#29030 [00:53:16] I think those are all false matches [01:01:58] What are the leading : in page names? [01:01:59] Reedy: I think we beat it -- https://phabricator.wikimedia.org/P5404#29032 [01:02:06] SMW artefacts to cirrus? [01:02:11] that's an artifact of my hack [01:02:22] it means main namespace [01:02:39] I listed as namespace_name:title from cirrus [01:03:58] Does that mean we can disable the extensions? :P [01:04:46] we should really try it, but maybe not right now [01:04:58] * bd808 is hungry and needs a bio break [01:11:35] (03PS1) 10Dzahn: icinga: fix syntax for variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352736 [01:11:49] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix syntax for variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352736 (owner: 10Dzahn) [01:12:13] (03PS2) 10Dzahn: icinga: fix syntax for variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352736 [01:13:10] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix syntax for variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352736 (owner: 10Dzahn) [01:22:14] (03PS3) 10Dzahn: icinga: fix syntax for variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352736 [01:23:12] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix syntax for variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352736 (owner: 10Dzahn) [01:24:16] (03PS4) 10Dzahn: icinga: fix syntax for variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352736 [01:26:41] (03CR) 10Dzahn: [C: 032] icinga: fix syntax for variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352736 (owner: 10Dzahn) [01:38:22] (03PS1) 10Dzahn: icinga: stop trying to use variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352738 [01:39:22] 06Operations, 10Phabricator, 06Release-Engineering-Team: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#3246213 (10Dzahn) This will be unblocked after T164810. [01:39:51] (03CR) 10Dzahn: [C: 032] icinga: stop trying to use variable variables in erb [puppet] - 10https://gerrit.wikimedia.org/r/352738 (owner: 10Dzahn) [02:21:48] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.21) (duration: 08m 17s) [02:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:42] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [02:25:43] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [02:27:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 9 02:27:46 UTC 2017 (duration 5m 58s) [02:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:00:39] (03PS1) 10Dzahn: icinga: comment out contacts-new.cfg [puppet] - 10https://gerrit.wikimedia.org/r/352742 [03:01:14] (03CR) 10Dzahn: [C: 032] icinga: comment out contacts-new.cfg [puppet] - 10https://gerrit.wikimedia.org/r/352742 (owner: 10Dzahn) [03:04:52] RECOVERY - puppet last run on tegmen is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [05:54:39] !log Deploy alter table on wikidatawiki.wb_terms on codfw master db2023 - https://phabricator.wikimedia.org/T162539 - https://phabricator.wikimedia.org/T163548 [05:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:53] (03PS2) 10Marostegui: db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352607 (https://phabricator.wikimedia.org/T163548) [05:58:43] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352607 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:00:01] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352607 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:00:11] (03CR) 10jenkins-bot: db-codfw.php: Repool db2038 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352607 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:00:53] (03PS2) 10Muehlenhoff: dumps: Explicitly declare package dependency for rpcbind [puppet] - 10https://gerrit.wikimedia.org/r/352123 (https://phabricator.wikimedia.org/T106477) [06:01:53] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2038 - T162539 T163548 (duration: 00m 41s) [06:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:03] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [06:02:03] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [06:09:42] (03Abandoned) 10Muehlenhoff: dumps: Explicitly declare package dependency for rpcbind [puppet] - 10https://gerrit.wikimedia.org/r/352123 (https://phabricator.wikimedia.org/T106477) (owner: 10Muehlenhoff) [06:10:11] (03Abandoned) 10Muehlenhoff: stat: Explicitly declare package dependency for rpcbind [puppet] - 10https://gerrit.wikimedia.org/r/352138 (https://phabricator.wikimedia.org/T106477) (owner: 10Muehlenhoff) [06:11:24] Note that the link between eqiad and codfw can go down at anytime as zayo is doing maintenance on it. Traffic should automatically route around it, let me know if any issue. [06:20:32] RECOVERY - MariaDB Slave SQL: s3 on db2057 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:21:17] (03PS1) 10Muehlenhoff: Drop cache/LVS NFS override [puppet] - 10https://gerrit.wikimedia.org/r/352748 (https://phabricator.wikimedia.org/T106477) [06:36:37] (03PS1) 10Muehlenhoff: Remove some dh_make cruft [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/352749 [06:37:06] !log Run pt-table-checksum on s7.cawiki - T163190 [06:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:14] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [06:38:56] (03CR) 10Muehlenhoff: [C: 032] Remove some dh_make cruft [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/352749 (owner: 10Muehlenhoff) [06:48:30] (03PS6) 10Marostegui: mariadb: grant user 'phstats' additional select on differential db [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [06:51:31] (03CR) 10Marostegui: [C: 032] mariadb: grant user 'phstats' additional select on differential db [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [06:52:34] (03CR) 10Marostegui: "Changes applied on m3 master db1043" [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [07:01:47] <_joe_> !log installing the new version of python-service-checker across the fleet [07:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:02] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.06 seconds [07:08:14] (03PS1) 10Marostegui: db-codfw.php: Repool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352752 (https://phabricator.wikimedia.org/T149526) [07:09:47] 06Operations, 10ops-ulsfo: Degraded RAID on cp4007 - https://phabricator.wikimedia.org/T164701#3246408 (10MoritzMuehlenhoff) 05Open>03Invalid That was an NRPE connection error, the RAID is fine: jmm@cp4007:~$ sudo /usr/local/lib/nagios/plugins/check_raid md OK: Active: 2, Working: 2, Failed: 0, Spare: 0 OK [07:09:48] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352752 (https://phabricator.wikimedia.org/T149526) (owner: 10Marostegui) [07:10:22] 06Operations, 10ops-ulsfo: Degraded RAID on cp4006 - https://phabricator.wikimedia.org/T164779#3246411 (10MoritzMuehlenhoff) 05Open>03Invalid That was an NRPE connection error, the RAID is fine: jmm@cp4006:~$ sudo /usr/local/lib/nagios/plugins/check_raid md OK: Active: 2, Working: 2, Failed: 0, Spare: 0 OK [07:11:06] (03Merged) 10jenkins-bot: db-codfw.php: Repool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352752 (https://phabricator.wikimedia.org/T149526) (owner: 10Marostegui) [07:11:14] (03CR) 10jenkins-bot: db-codfw.php: Repool es2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352752 (https://phabricator.wikimedia.org/T149526) (owner: 10Marostegui) [07:11:26] !log reboot kafka1014 for kernel upgrades [07:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:23] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool es2019 - T149526 (duration: 00m 39s) [07:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:31] T149526: es2019 crashed again - https://phabricator.wikimedia.org/T149526 [07:12:47] 06Operations, 10Monitoring: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996#3246419 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:13:03] 06Operations, 10Monitoring: check_hpssacli should report on battery failures and cache disabled - https://phabricator.wikimedia.org/T163998#3246421 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:13:14] 06Operations, 10DBA, 13Patch-For-Review: es2019 crashed again - https://phabricator.wikimedia.org/T149526#3246422 (10Marostegui) 05Open>03Resolved As there were no differences found by compare.py I have repooled this host. Finger crossed so it doesn't crash anymore! [07:19:52] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:20:12] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:20:17] 06Operations, 06Operations-Software-Development: MIgrate debdeploy to cumin - https://phabricator.wikimedia.org/T164817#3246425 (10MoritzMuehlenhoff) [07:20:40] * elukey hates mirror maker [07:21:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:22:12] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [07:27:30] !log Disable replication codfw > eqiad on s6 - T147166 T130067 [07:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:39] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [07:27:39] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [07:31:22] !log Stop replication at the same position on db1050 and db2028 - T147166 T130067 [07:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:54] !log removing unneeded rpcbind/nfs-common packages (T106477) [07:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:01] T106477: Reduce rpcbind use - https://phabricator.wikimedia.org/T106477 [07:38:02] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:38:12] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:41:02] RECOVERY - Kafka MirrorMaker main-eqiad_to_analytics on kafka1012 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [07:41:12] RECOVERY - Check systemd state on kafka1012 is OK: OK - running: The system is fully operational [07:42:03] Mirror Maker is not super resilient on kafka reboots and topic leader changes [07:42:17] (this is mirroring main-eqiad to analytics) [07:42:42] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 605.12 seconds [07:42:52] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 620.99 seconds [07:42:52] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.04 seconds [07:42:53] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.12 seconds [07:43:00] That is me [07:43:02] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 623.06 seconds [07:43:10] Only silenced the master [07:43:14] Going to silence those too, sorry [07:44:20] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3246464 (10Aklapper) [07:47:31] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3246471 (10ayounsi) FYI @Cmjohnson @Jgreen that host is flapping a lot: {P5406} ```> show interfaces ge-2/0/8 extensive | match error Carrier transitions: 1869``` This is usually the sign of... [07:52:03] 06Operations: reprepro: Support for buildinfo files / dbgsym packages - https://phabricator.wikimedia.org/T164819#3246492 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:53:28] !log uploaded kubernetes 1.4.2-6 for stretch-wikimedia to apt.wikimedia.org [07:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:41] (03PS2) 10Filippo Giunchedi: swift: add ratelimit middleware [puppet] - 10https://gerrit.wikimedia.org/r/350220 (https://phabricator.wikimedia.org/T162793) [08:00:08] (03CR) 10Filippo Giunchedi: [C: 032] swift: add ratelimit middleware [puppet] - 10https://gerrit.wikimedia.org/r/350220 (https://phabricator.wikimedia.org/T162793) (owner: 10Filippo Giunchedi) [08:00:14] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3246519 (10ayounsi) [08:05:15] !log roll-restart swift proxy for ratelimit middleware - T162793 [08:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:23] T162793: Rate limit swift operations - https://phabricator.wikimedia.org/T162793 [08:09:23] (03PS1) 10Elukey: Set automatic systemd restart for kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/352759 (https://phabricator.wikimedia.org/T157705) [08:09:38] moritzm: if you have time can you check if --^ makes sense [08:09:39] ? [08:10:15] elukey: sure, will have a look at about 10 mins [08:10:48] super [08:20:31] <_joe_> I have to admit: converting current roles to role/profile can be tiresome, but the results are usually much better than the starting point [08:21:08] 06Operations, 15User-fgiunchedi: Reduce Swift technical debt - https://phabricator.wikimedia.org/T162792#3246561 (10fgiunchedi) [08:21:10] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Rate limit swift operations - https://phabricator.wikimedia.org/T162793#3246559 (10fgiunchedi) 05Open>03Resolved Completed, we're rate limiting non-mediawiki accounts both for container delete/create and object create/delete/update (for containers with... [08:25:31] (03CR) 10Hoo man: [C: 031] Do not rebuild or make dumps of wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [08:28:09] (03PS1) 10Filippo Giunchedi: thumbstats: fix Hive INSERT syntax [software] - 10https://gerrit.wikimedia.org/r/352761 [08:29:25] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/352759 (https://phabricator.wikimedia.org/T157705) (owner: 10Elukey) [08:29:30] (03CR) 10Filippo Giunchedi: [C: 032] thumbstats: fix Hive INSERT syntax [software] - 10https://gerrit.wikimedia.org/r/352761 (owner: 10Filippo Giunchedi) [08:30:12] a user reports that part of his contribution history seems to contain duplicate entries for his contributions... [08:30:15] https://en.wikipedia.org/w/index.php?title=Special:Contributions/Chaheel_Riens&offset=20160823132427&limit=500&target=Chaheel+Riens [08:35:37] 06Operations, 10Traffic: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3246599 (10ema) p:05Triage>03Normal [08:38:04] 06Operations, 10Traffic: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3234359 (10ema) p:05Triage>03Normal [08:45:38] 06Operations, 06Labs: Stretch vs. Salt - https://phabricator.wikimedia.org/T164595#3246611 (10MoritzMuehlenhoff) The debdeploy salt minion is quite self-contained, I wouldn't expect any problems with varying salt versions for debdeploy. [08:53:29] 06Operations, 10Icinga: move icinga contacts file to public repo - https://phabricator.wikimedia.org/T164238#3246616 (10MoritzMuehlenhoff) a:03Dzahn Assigning to Daniel, since he's working on that already. [08:57:52] PROBLEM - HP RAID on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [08:58:40] I thought the downtime was enough, clearly not [09:00:54] (03Abandoned) 10Muehlenhoff: Fix proxy configuration for docker image build [puppet] - 10https://gerrit.wikimedia.org/r/352545 (owner: 10Muehlenhoff) [09:03:10] huh, did the host key of bast3002 change? [09:03:37] oh, wait, terbium's changed [09:03:37] nvm [09:04:42] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [09:04:52] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 0.06 seconds [09:04:52] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 0.08 seconds [09:04:52] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 0.12 seconds [09:05:02] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [09:07:22] RECOVERY - HP RAID on ms-be1021 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller [09:07:24] !log Removed 2fa from global account Jcornelius (T164682) [09:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:33] T164682: Reset Cornelius Kibelka's 2FA - https://phabricator.wikimedia.org/T164682 [09:11:50] (03PS2) 10Gehel: kibana - cleanup of /opt/kibana has been done [puppet] - 10https://gerrit.wikimedia.org/r/352666 (https://phabricator.wikimedia.org/T161908) [09:12:07] !log Run pt-table-checksum on s7.eswiki - T163190 [09:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:15] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [09:13:52] (03CR) 10Gehel: [C: 032] kibana - cleanup of /opt/kibana has been done [puppet] - 10https://gerrit.wikimedia.org/r/352666 (https://phabricator.wikimedia.org/T161908) (owner: 10Gehel) [09:21:39] !log Disable replication codfw > eqiad on s4 - https://phabricator.wikimedia.org/T147166 https://phabricator.wikimedia.org/T130067 [09:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:29] (03PS1) 10Alexandros Kosiaris: sshd_config: Increase MaxAuthTries [puppet] - 10https://gerrit.wikimedia.org/r/352766 [09:23:42] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 526.00 seconds [09:25:54] I will check that, but I am sure it is the BBU again [09:26:17] And it is: Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU [09:27:39] (03CR) 10Muehlenhoff: "This is only needed for deployment targets? I'd rather have this grouped under a Match: conditional to only apply to internal hosts." [puppet] - 10https://gerrit.wikimedia.org/r/352766 (owner: 10Alexandros Kosiaris) [09:27:48] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3246659 (10Marostegui) This just happened again: ``` root@db1048:~# megacli -AdpBbuCmd -a0 BBU status for Adapter: 0 BatteryType: BBU Battery State: Unknown Battery backup ch... [09:28:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352767 (https://phabricator.wikimedia.org/T147166) [09:30:42] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [09:31:52] (03CR) 10Alexandros Kosiaris: "Yes, I probably can do a Match based on the deployment hosts IP address and only override this for connections from the 2 deployments host" [puppet] - 10https://gerrit.wikimedia.org/r/352766 (owner: 10Alexandros Kosiaris) [09:34:03] (03PS1) 10Marostegui: wmnet: Point m3 slave to codfw master [dns] - 10https://gerrit.wikimedia.org/r/352769 (https://phabricator.wikimedia.org/T160731) [09:34:22] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3246675 (10Marostegui) And after the manual relearn it is back to normal state: ``` root@db1048:~# megacli -ldinfo -l0 -a0 | grep Policy Default Cache Policy... [09:34:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352767 (https://phabricator.wikimedia.org/T147166) (owner: 10Marostegui) [09:35:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352767 (https://phabricator.wikimedia.org/T147166) (owner: 10Marostegui) [09:36:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352767 (https://phabricator.wikimedia.org/T147166) (owner: 10Marostegui) [09:37:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T147166 T130067 (duration: 00m 41s) [09:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:23] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [09:37:23] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [09:41:58] !log Stop replication at the same position on db1097 and db2019 - https://phabricator.wikimedia.org/T147166 https://phabricator.wikimedia.org/T130067 [09:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:28] (03CR) 10Volans: "We couldn't setup a .ssh/config file for the users that uses scap with the options to use the right key?" [puppet] - 10https://gerrit.wikimedia.org/r/352766 (owner: 10Alexandros Kosiaris) [09:47:58] (03PS2) 10Alexandros Kosiaris: sshd_config: Increase MaxAuthTries [puppet] - 10https://gerrit.wikimedia.org/r/352766 [09:50:21] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352645 (owner: 10Dzahn) [09:50:26] (03PS2) 10Alexandros Kosiaris: puppetmaster: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352645 (owner: 10Dzahn) [09:50:30] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetmaster: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352645 (owner: 10Dzahn) [09:53:06] PROBLEM - MariaDB Slave Lag: s4 on db1097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.42 seconds [09:53:11] (03CR) 10Jcrespo: [C: 031] wmnet: Point m3 slave to codfw master [dns] - 10https://gerrit.wikimedia.org/r/352769 (https://phabricator.wikimedia.org/T160731) (owner: 10Marostegui) [09:53:21] I silenced it [09:53:23] db1097 [09:53:45] (03PS2) 10Elukey: Set automatic systemd restart for kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/352759 (https://phabricator.wikimedia.org/T157705) [09:53:56] strange... [09:53:58] sorry for the page [09:54:05] ? [09:54:24] so it is maintenance? [09:54:29] I silenced db1097 before starting all the process, but it paged anyways, so maybe it was lost [09:54:32] yes [09:54:57] ok, np, I though we already had hw issues with the new servers [09:55:13] hehe no (yet) [09:55:44] (I imagine you answering laughing alound and then getting serious to say "yet") [09:56:27] (03CR) 10Elukey: [C: 032] Set automatic systemd restart for kafka mirror maker [puppet] - 10https://gerrit.wikimedia.org/r/352759 (https://phabricator.wikimedia.org/T157705) (owner: 10Elukey) [09:57:57] !log restarting hhvm on mw1190, deadlocked in HPHP::Treadmill::getAgeOldestRequest [09:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:21] (03CR) 10Alexandros Kosiaris: "@volans, wouldn't that require some changes to scap so that it knows which is the "right" key? Effectively setting -o Identity=,Ident" [puppet] - 10https://gerrit.wikimedia.org/r/352766 (owner: 10Alexandros Kosiaris) [09:58:39] (03CR) 10Alexandros Kosiaris: "sample PCC output at https://puppet-compiler.wmflabs.org/6325/netmon1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/352766 (owner: 10Alexandros Kosiaris) [09:58:44] jynus: https://www.youtube.com/watch?v=gTghUoScGO8 [09:59:11] (03PS4) 10Alexandros Kosiaris: squid3: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352643 (owner: 10Dzahn) [09:59:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] squid3: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352643 (owner: 10Dzahn) [09:59:52] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.104 second response time [09:59:52] RECOVERY - Nginx local proxy to apache on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.178 second response time [09:59:52] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 74098 bytes in 0.326 second response time [10:01:01] akosiaris: what I meant is, if scap uses ssh directly, will be ssh to load the right config, what I'm wondering is that scap is run as our users IIRC, so it might be tricky anyway [10:02:48] yes scap runs like our users so it's not easy to do that via just a .ssh/config file [10:03:10] cause you might run scap deploy in repo a) and use user A and in repo b) and use user B [10:03:15] (03CR) 10Volans: "@bblack: maybe something like this could work :) 0.66*x - 12*log10(x)" [puppet] - 10https://gerrit.wikimedia.org/r/352601 (owner: 10BBlack) [10:04:13] and there is a problem if two different services are deployed to the same set of hosts with 2 different keys [10:04:28] because then you cannot specify them in the ssh config anyway :/ [10:05:20] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/352766 (owner: 10Alexandros Kosiaris) [10:06:54] (03CR) 10Alexandros Kosiaris: [C: 032] librenms: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352631 (owner: 10Dzahn) [10:07:00] (03PS2) 10Alexandros Kosiaris: librenms: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352631 (owner: 10Dzahn) [10:07:22] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3246794 (10Qse24h) [10:07:30] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3246796 (10Qse24h) [10:07:58] volans: no ? ssh_config says "It is possible to have multiple identity files specified in configuration files; all these identities will be tried" [10:08:07] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Qse24h) [10:08:24] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10Qse24h) [10:08:52] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Qse24h) [10:08:56] oh, didn't know. Ok, then the problem is just to have to setup the config file for all users in the group that can use scap for a given project... [10:09:12] !log reboot kafka1020 for kernel upgrades [10:09:14] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10Qse24h) [10:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:53] "BGP session between pfw clusters flapping" lovely technical description :) [10:10:04] I guess I should take a quick look at the scap source [10:10:08] maybe it's easy [10:10:19] * akosiaris knows he should not have said that [10:10:26] last words :D [10:10:54] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Qse24h) [10:11:01] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3246906 (10Qse24h) [10:11:14] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Qse24h) [10:12:39] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Qse24h) [10:12:42] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10Qse24h) [10:12:51] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Qse24h) [10:12:52] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [10:13:34] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Qse24h) [10:13:37] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3245035 (10Qse24h) [10:13:48] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Qse24h) [10:15:23] 06Operations, 10Traffic: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3244737 (10Qse24h) [10:15:41] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:16:55] 06Operations, 10Traffic: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3244737 (10Qse24h) [10:17:16] 06Operations, 10Traffic: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3244737 (10Qse24h) [10:18:04] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:18:10] Mirror maker auto-restarted this time [10:18:11] goooood [10:18:43] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3247127 (10Qse24h) [10:19:06] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243584 (10Qse24h) [10:20:31] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:20:51] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243584 (10Qse24h) [10:20:56] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:21:04] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:21:36] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:21:54] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:22:39] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:22:47] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3244082 (10Qse24h) [10:22:57] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:23:34] (03PS3) 10Alexandros Kosiaris: sshd_config: Increase MaxAuthTries [puppet] - 10https://gerrit.wikimedia.org/r/352766 [10:23:38] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] sshd_config: Increase MaxAuthTries [puppet] - 10https://gerrit.wikimedia.org/r/352766 (owner: 10Alexandros Kosiaris) [10:24:14] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243584 (10Qse24h) [10:24:22] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243584 (10Qse24h) [10:24:28] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:24:56] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243584 (10Qse24h) [10:24:58] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243584 (10Qse24h) [10:25:16] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3243584 (10Qse24h) [10:27:14] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:27:36] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:27:41] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:28:05] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:28:07] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:28:22] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:29:01] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3243169 (10Qse24h) [10:30:33] 06Operations, 10fundraising-tech-ops, 10netops: BGP session between pfw clusters flapping - https://phabricator.wikimedia.org/T164777#3247482 (10MoritzMuehlenhoff) 05duplicate>03Open [10:30:39] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3247483 (10Peachey88) 05duplicate>03Open [10:30:49] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3247493 (10MoritzMuehlenhoff) 05duplicate>03Open [10:31:17] 06Operations, 10Traffic: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3247508 (10MoritzMuehlenhoff) 05duplicate>03Open [10:31:24] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3247512 (10MoritzMuehlenhoff) 05duplicate>03Open [10:31:27] (03CR) 10Marostegui: [C: 032] wmnet: Point m3 slave to codfw master [dns] - 10https://gerrit.wikimedia.org/r/352769 (https://phabricator.wikimedia.org/T160731) (owner: 10Marostegui) [10:32:59] 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3247560 (10Aklapper) 05duplicate>03Open [10:34:17] !log reboot kafka1022 for kernel upgrades [10:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:59] !log akosiaris@tin Started deploy [librenms/librenms@259e998]: (no justification provided) [10:35:02] !log akosiaris@tin Finished deploy [librenms/librenms@259e998]: (no justification provided) (duration: 00m 02s) [10:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:07] perfect [10:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:33] !log ayounsi@tin Started deploy [librenms/librenms@259e998]: (no justification provided) [10:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:43] !log ayounsi@tin Finished deploy [librenms/librenms@259e998]: (no justification provided) (duration: 00m 09s) [10:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:42] (03PS3) 10Alexandros Kosiaris: services: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352641 (owner: 10Dzahn) [10:48:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] services: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352641 (owner: 10Dzahn) [10:50:57] (03PS1) 10Ladsgroup: Enable sending Wikidata notification on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352795 (https://phabricator.wikimedia.org/T142103) [11:03:36] !log forced net.netfilter.nf_conntrack_tcp_timeout_time_wait = 65 to all the kafka brokers [11:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:36] (03PS1) 10Volans: ClusterShell: fix set of list options [software/cumin] - 10https://gerrit.wikimedia.org/r/352796 (https://phabricator.wikimedia.org/T164824) [11:16:08] RECOVERY - MariaDB Slave Lag: s4 on db1097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:20:51] (03PS1) 10Ladsgroup: Do not make dumps of wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352797 (https://phabricator.wikimedia.org/T140890) [11:21:35] (03CR) 10Ladsgroup: "Moved dumps part to I5641d6d88c10c0a7ab24c5124420793093dc4e02" [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) (owner: 10Ladsgroup) [11:22:12] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352798 [11:24:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352798 (owner: 10Marostegui) [11:25:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352798 (owner: 10Marostegui) [11:25:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352798 (owner: 10Marostegui) [11:26:42] PROBLEM - DPKG on restbase2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:26:49] (03PS1) 10Volans: ClusterShell: output directly when single host [software/cumin] - 10https://gerrit.wikimedia.org/r/352799 (https://phabricator.wikimedia.org/T164827) [11:27:04] (03PS2) 10Ladsgroup: Do not rebuild wb_entity_per_page [puppet] - 10https://gerrit.wikimedia.org/r/352574 (https://phabricator.wikimedia.org/T140890) [11:27:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1097 - T147166 T130067 (duration: 00m 39s) [11:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:24] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [11:27:24] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [11:28:42] RECOVERY - DPKG on restbase2002 is OK: All packages OK [11:30:44] (03PS1) 10Filippo Giunchedi: prometheus: add aggregated PDU stats [puppet] - 10https://gerrit.wikimedia.org/r/352800 (https://phabricator.wikimedia.org/T148541) [11:34:29] 06Operations, 13Patch-For-Review, 07Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#1000364 (10fgiunchedi) Torrus functions for PDU-related aggregated metrics has been replaced by Prometheus in {T148541}. Left TODO is to restore torrus RRDs from bacula and import them into graphite f... [11:35:11] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add aggregated PDU stats [puppet] - 10https://gerrit.wikimedia.org/r/352800 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [11:35:11] !log rebooting restbase2002 for update to Linux 4.9 and new OpenJDK [11:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:01] 06Operations, 13Patch-For-Review, 07Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#3247716 (10faidon) How are we going to be able to look at the trends of both historical and current data? Also, what are we going to do for data that we keep past Prometheus retention period? The powe... [11:45:22] !log Disable replication codfw > eqiad on s5 - https://phabricator.wikimedia.org/T147166 https://phabricator.wikimedia.org/T130067 [11:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:53] !log Stop replication at the same position on db1049 and db2023 - https://phabricator.wikimedia.org/T147166 https://phabricator.wikimedia.org/T130067 [11:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:13] (03PS2) 10Reedy: wfLoadExtension( 'ZeroBanner' ) in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348295 (https://phabricator.wikimedia.org/T163041) [12:01:36] Anyone have any idea why there is no Morning SWAT today? [12:02:20] oh, is there never a morning swat on tuesdays? [12:02:21] hah [12:12:39] hello [12:13:08] how do I track tr.wikipedia logo update request? apparently it has been done already [12:17:16] (03PS2) 10Hashar: interface: add rspec boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/340420 [12:17:18] (03PS4) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 [12:19:04] !log rebooting restbase2003 for update to Linux 4.9 and new OpenJDK [12:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:39] (03CR) 10Hashar: "I have cherry picked it on the deployment-prep puppetmaster. That partlly unbroke puppet on deployment-ircd and deployment-urldownloader." [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [12:28:30] (03CR) 10Alexandros Kosiaris: [C: 032] etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352637 (owner: 10Dzahn) [12:28:35] (03PS2) 10Alexandros Kosiaris: etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352637 (owner: 10Dzahn) [12:28:38] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] etcd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352637 (owner: 10Dzahn) [12:34:27] !log upgrade ELK on deplyoment-logstash2 [12:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:48] lots of errors on mwdebug [12:34:52] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:38:52] PROBLEM - puppet last run on etcd1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:41:06] Urbanecm: just checking here on T164239 that deployment was delayed to today correct? [12:41:06] T164239: Restrict page move to autopatrolled on Hindi Wikipedia (hiwiki) - https://phabricator.wikimedia.org/T164239 [12:42:52] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:03] PROBLEM - puppet last run on etcd1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:49:33] PROBLEM - puppet last run on conf2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:50:53] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:50:54] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:51:17] akosiaris: puppet failures are caused by the logrotate change [12:51:59] Invalid relationship: Rsyslog::Conf[etcd] {require => File[/etc/logrotate.d/etcd}, because File doesn't seem to be in the catalogue [12:52:23] sigh ok reverting [12:52:32] thanks! [12:53:24] (03PS1) 10Alexandros Kosiaris: Revert "etcd: use logrotate::conf for logrotate" [puppet] - 10https://gerrit.wikimedia.org/r/352816 [12:53:29] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "etcd: use logrotate::conf for logrotate" [puppet] - 10https://gerrit.wikimedia.org/r/352816 (owner: 10Alexandros Kosiaris) [12:53:32] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "etcd: use logrotate::conf for logrotate" [puppet] - 10https://gerrit.wikimedia.org/r/352816 (owner: 10Alexandros Kosiaris) [12:55:53] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:57:34] jouncebot: refresh [12:57:36] I refreshed my knowledge about deployments. [12:57:39] jouncebot: next [12:57:39] In 0 hour(s) and 2 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170509T1300) [13:00:06] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170509T1300). Please do the needful. [13:00:06] kart_, Urbanecm, Amir1, and reedy: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:13] o/ [13:00:16] * kart_ here [13:00:52] o/ [13:00:58] o/ [13:01:19] I havent reviewed anything :( [13:01:53] RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:02:30] They're nearly all config patches [13:02:37] !log rebooting restbase2004 for update to Linux 4.9 and new OpenJDK [13:02:42] 6 should be easy [13:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:53] hashar: you are in charge of swat today? [13:04:02] I might have one or 2 to add at the end of there is space,one will need a full scap though [13:05:24] (03PS3) 10Reedy: wfLoadExtension( 'ZeroBanner' ) in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348295 (https://phabricator.wikimedia.org/T163041) [13:05:29] (03CR) 10Reedy: [C: 032] wfLoadExtension( 'ZeroBanner' ) in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348295 (https://phabricator.wikimedia.org/T163041) (owner: 10Reedy) [13:05:31] let's jfdi [13:05:44] I would help but in a meeting :) [13:05:52] (03PS1) 10Ayounsi: - Remove trailing ?> (to match .default file) - Fix current macro (ifDescr -> ifAlias) - Add Junos state specific macros [puppet] - 10https://gerrit.wikimedia.org/r/352820 [13:06:23] (03PS2) 10Ayounsi: - Remove trailing ?> (to match .default file) - Fix current macro (ifDescr -> ifAlias) - Add Junos state specific macros [puppet] - 10https://gerrit.wikimedia.org/r/352820 [13:07:00] (03Merged) 10jenkins-bot: wfLoadExtension( 'ZeroBanner' ) in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348295 (https://phabricator.wikimedia.org/T163041) (owner: 10Reedy) [13:07:23] (03PS2) 10Reedy: PageTriage to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348296 (https://phabricator.wikimedia.org/T139800) [13:07:28] (03PS3) 10Reedy: PageTriage to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348296 (https://phabricator.wikimedia.org/T139800) [13:07:32] (03CR) 10Reedy: [C: 032] PageTriage to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348296 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [13:07:53] RECOVERY - puppet last run on etcd1004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:08:09] (03CR) 10Addshore: wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [13:08:13] (03PS2) 10Addshore: wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 [13:08:37] !log reedy@tin Synchronized wmf-config/mobile.php: wfLoadExtension for ZeroBanner (duration: 00m 41s) [13:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:53] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:10:01] (03CR) 10Hashar: Enable sending Wikidata notification on Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352795 (https://phabricator.wikimedia.org/T142103) (owner: 10Ladsgroup) [13:10:11] (03Merged) 10jenkins-bot: PageTriage to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348296 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [13:10:13] Amir1: for the wikibase notification patch on https://gerrit.wikimedia.org/r/#/c/352795/ [13:10:35] Amir1: those settings are enabled on testing wikis a few lines above (line 99 and below) [13:10:52] yes [13:11:12] !log reedy@tin Synchronized wmf-config/extension-list: PageTriage to extension.json in extension-list (duration: 00m 39s) [13:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:47] hashar: the reason I did it was that we are going to add more families later [13:12:08] Amir1: ahhh [13:12:18] (03CR) 10Hashar: [C: 031] Enable sending Wikidata notification on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352795 (https://phabricator.wikimedia.org/T142103) (owner: 10Ladsgroup) [13:13:27] (03PS3) 10Addshore: wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 [13:13:57] kart_: ContentTranslation patch is on mwdebug1001 [13:14:02] hashar: Reedy I added 2 if there is time :) [13:14:09] hashar: testing. [13:14:23] Reedy: I have pulled a change to ContentTranslation [13:16:13] RECOVERY - puppet last run on etcd1006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:16:49] addshore: yes add them [13:17:01] {{done}} [13:17:10] 1 on the branch and 1 config [13:17:33] RECOVERY - puppet last run on conf2002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:17:41] 06Operations, 10DBA, 13Patch-For-Review, 05Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#3247903 (10jcrespo) [13:18:04] Reedy: have you pushed other changes beside yours? [13:18:12] Not yet [13:18:42] ok [13:18:45] hashar: testing. Bit lengthy stuff. [13:18:53] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:18:53] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:18:53] kart_: no worries. take your time [13:21:02] (03CR) 10jenkins-bot: wfLoadExtension( 'ZeroBanner' ) in mobile.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348295 (https://phabricator.wikimedia.org/T163041) (owner: 10Reedy) [13:21:04] (03CR) 10jenkins-bot: PageTriage to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348296 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [13:21:58] 06Operations, 10DBA: In some database hosts, performance schema loses digest statistics - https://phabricator.wikimedia.org/T164834#3247935 (10jcrespo) [13:22:53] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:23:09] 06Operations, 10DBA: In some database hosts, performance schema loses digest statistics - https://phabricator.wikimedia.org/T164834#3247953 (10jcrespo) Only added volans because once we didn't see a query being registered, and this is the explanation why- but it should be fixed (you do not need to do anything,... [13:23:38] !log upgrading elasticsearch on relforge - T163703 [13:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:48] T163703: [epic] Upgrade elasticsearch to 5.3.2 - https://phabricator.wikimedia.org/T163703 [13:24:46] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3247958 (10jcrespo) We can deploy to codfw now, where worse case scenario, it would not cause a visible outage. We are really keen on those new features. [13:25:00] hashar: go ahead. Found another issue, not related to this patch. [13:25:19] kart_: doing :-} [13:25:44] rebasing Urbanecm patches [13:25:53] !log hashar@tin Synchronized php-1.29.0-wmf.21/extensions/ContentTranslation/modules/tools/ext.cx.tools.template.js: Fix the container calculation for template editor - T163105 (duration: 00m 40s) [13:25:56] (03PS2) 10Hashar: Allow new page patroll for autoconfirmed users on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351385 (https://phabricator.wikimedia.org/T164159) (owner: 10Urbanecm) [13:26:00] kart_: done! [13:26:00] (03PS2) 10Hashar: Allow page move only autopatrolled at hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351382 (https://phabricator.wikimedia.org/T164239) (owner: 10Urbanecm) [13:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:02] T163105: CX template editor's own HTML may end up in the published as an HTML blob - https://phabricator.wikimedia.org/T163105 [13:26:04] (03PS2) 10Hashar: Create Autor and Portal namespaces on Spanish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352250 (https://phabricator.wikimedia.org/T164195) (owner: 10Urbanecm) [13:26:17] !log ayounsi@tin Started deploy [librenms/librenms@b10cc7c]: (no justification provided) [13:26:18] hashar: thanks! [13:26:21] !log ayounsi@tin Finished deploy [librenms/librenms@b10cc7c]: (no justification provided) (duration: 00m 04s) [13:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:14] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351385 (https://phabricator.wikimedia.org/T164159) (owner: 10Urbanecm) [13:27:23] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351382 (https://phabricator.wikimedia.org/T164239) (owner: 10Urbanecm) [13:29:14] (03Merged) 10jenkins-bot: Allow new page patroll for autoconfirmed users on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351385 (https://phabricator.wikimedia.org/T164159) (owner: 10Urbanecm) [13:29:41] (03Merged) 10jenkins-bot: Allow page move only autopatrolled at hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351382 (https://phabricator.wikimedia.org/T164239) (owner: 10Urbanecm) [13:29:59] Reedy: are you done with ZeroBanner / PageTriage ? [13:32:39] well assuming it is [13:33:21] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Allow new page patroll for autoconfirmed users on bnwiki - T164159 (duration: 00m 40s) [13:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:29] T164159: Make newpage patrolling available to autoconfirmed users on bnwiki - https://phabricator.wikimedia.org/T164159 [13:35:31] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Allow page move only autopatrolled at hiwiki - T164239 (duration: 00m 42s) [13:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:39] T164239: Restrict page move to autopatrolled on Hindi Wikipedia (hiwiki) - https://phabricator.wikimedia.org/T164239 [13:35:59] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352250 (https://phabricator.wikimedia.org/T164195) (owner: 10Urbanecm) [13:36:55] (03PS2) 10Hashar: Enable sending Wikidata notification on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352795 (https://phabricator.wikimedia.org/T142103) (owner: 10Ladsgroup) [13:37:04] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352795 (https://phabricator.wikimedia.org/T142103) (owner: 10Ladsgroup) [13:38:44] (03Merged) 10jenkins-bot: Create Autor and Portal namespaces on Spanish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352250 (https://phabricator.wikimedia.org/T164195) (owner: 10Urbanecm) [13:39:41] !log cancel upgrading elasticsearch on relforge (plugin under test is missing a release for 5.3.2) - T163703 [13:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:49] T163703: [epic] Upgrade elasticsearch to 5.3.2 - https://phabricator.wikimedia.org/T163703 [13:40:21] (03CR) 10Alexandros Kosiaris: [C: 031] - Remove trailing ?> (to match .default file) - Fix current macro (ifDescr -> ifAlias) - Add Junos state specific macros [puppet] - 10https://gerrit.wikimedia.org/r/352820 (owner: 10Ayounsi) [13:40:37] (03Merged) 10jenkins-bot: Enable sending Wikidata notification on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352795 (https://phabricator.wikimedia.org/T142103) (owner: 10Ladsgroup) [13:41:35] (03CR) 10Ayounsi: [C: 032] - Remove trailing ?> (to match .default file) - Fix current macro (ifDescr -> ifAlias) - Add Junos state specific macros [puppet] - 10https://gerrit.wikimedia.org/r/352820 (owner: 10Ayounsi) [13:42:33] (03PS1) 10Milimetric: Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352825 [13:43:25] (03PS5) 10BBlack: maps->upload functional cluster-level changes [puppet] - 10https://gerrit.wikimedia.org/r/351663 (https://phabricator.wikimedia.org/T164608) [13:43:27] (03PS1) 10BBlack: varnish: align on 1d TTL work [puppet] - 10https://gerrit.wikimedia.org/r/352826 [13:43:29] (03PS1) 10BBlack: varnish: move nuke/lru and exp_thread stuff to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/352827 [13:43:36] (03CR) 10jenkins-bot: Allow new page patroll for autoconfirmed users on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351385 (https://phabricator.wikimedia.org/T164159) (owner: 10Urbanecm) [13:43:38] (03CR) 10jenkins-bot: Allow page move only autopatrolled at hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351382 (https://phabricator.wikimedia.org/T164239) (owner: 10Urbanecm) [13:43:40] (03CR) 10jenkins-bot: Create Autor and Portal namespaces on Spanish Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352250 (https://phabricator.wikimedia.org/T164195) (owner: 10Urbanecm) [13:43:42] (03CR) 10jenkins-bot: Enable sending Wikidata notification on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352795 (https://phabricator.wikimedia.org/T142103) (owner: 10Ladsgroup) [13:44:01] (03PS2) 10Alexandros Kosiaris: lvs: Add the kubernetes master service/cluster [puppet] - 10https://gerrit.wikimedia.org/r/352580 (https://phabricator.wikimedia.org/T162040) [13:44:05] (03PS2) 10Alexandros Kosiaris: Migrate to using kubemaster.svc.$site.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/352581 [13:44:26] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Create Autor and Portal namespaces on Spanish Wikisource - T164195 (duration: 00m 39s) [13:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:34] T164195: Create Autor and Portal namespaces on Spanish Wikisource - https://phabricator.wikimedia.org/T164195 [13:44:56] (03CR) 10Milimetric: [C: 032] Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352825 (owner: 10Milimetric) [13:45:21] (03CR) 10Milimetric: [C: 032] "made sure there's no freeze to deploy: https://wikitech.wikimedia.org/wiki/Deployments#Tuesday.2C.C2.A0May.C2.A009 but I'm not clear on wh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352825 (owner: 10Milimetric) [13:46:47] !log upgrade deployment-prep cluster to elasticsearch 5.3.2 - T163707 [13:46:48] Amir1: I am syncing your change [13:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:55] T163707: Upgrade the deployment-prep cluster to elastic 5.3.2 - https://phabricator.wikimedia.org/T163707 [13:46:59] Thanks [13:47:07] !log hashar@tin Synchronized wmf-config/Wikibase-production.php: Enable sending Wikidata notification on Wikivoyage - T142103 (duration: 00m 39s) [13:47:07] (03Merged) 10jenkins-bot: Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352825 (owner: 10Milimetric) [13:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:14] T142103: Decide on which clients users should get notifications - https://phabricator.wikimedia.org/T142103 [13:47:14] It's not testable in mwdebug [13:48:37] (03PS4) 10Hashar: wmgUseTwoColConflict true for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [13:48:55] addshore: I am syncing twocol things :-} [13:48:56] o/ [13:49:02] Okay, please each sepertly [13:49:10] !log hashar@tin Synchronized php-1.29.0-wmf.21/extensions/TwoColConflict: BACKPORTS from master - T162806 T163886 (duration: 00m 41s) [13:49:17] addshore: code is updated now [13:49:17] I can verify the backport before the config change [13:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:19] T163886: Adapt help dialogue to moment of beta deploy - https://phabricator.wikimedia.org/T163886 [13:49:19] T162806: Make TwoColConflict a beta feature on all wikis - https://phabricator.wikimedia.org/T162806 [13:49:21] Checking [13:49:24] (03CR) 10jenkins-bot: Remove redundant Dashiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352825 (owner: 10Milimetric) [13:50:27] hashar: thats code change will need a full scap [13:50:40] *that [13:50:58] ah yeah the l10n files [13:51:05] :-) [13:51:13] * CFisch_WMDE also testing it atm [13:51:14] !log hashar@tin Started scap: TwoColConflict update [13:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:28] (: [13:51:29] addshore: CFisch_WMDE: full scap ongoing [13:51:34] (: or :( ? :D [13:51:43] (: [13:51:44] ): [13:51:48] :D [13:51:53] C-: [13:53:03] *twiddles thumbs* [13:53:36] *hovers over Ctrl+R* [13:54:53] (03PS1) 10Gehel: elasticsearch - row / rack configuration should not be empty [puppet] - 10https://gerrit.wikimedia.org/r/352831 (https://phabricator.wikimedia.org/T163707) [13:57:40] (03CR) 10DCausse: [C: 031] elasticsearch - row / rack configuration should not be empty [puppet] - 10https://gerrit.wikimedia.org/r/352831 (https://phabricator.wikimedia.org/T163707) (owner: 10Gehel) [13:58:35] progress [13:58:39] (03CR) 10Gehel: [C: 032] elasticsearch - row / rack configuration should not be empty [puppet] - 10https://gerrit.wikimedia.org/r/352831 (https://phabricator.wikimedia.org/T163707) (owner: 10Gehel) [13:58:43] addshore: CFisch_WMDE: scap now checking canaries [13:59:58] ack! [14:00:25] 06Operations, 10ops-eqiad: configure RAID on frlog1001 - https://phabricator.wikimedia.org/T164748#3248104 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Raid has been setup for raid 6 [14:00:53] logstash is happy [14:00:57] huge sync to the proxies is going on [14:01:27] It's mainly JS and also only enabled in 1 place :P [14:01:45] sync-apaches: 0% (ok: 0; fail: 0; left: 300) [14:01:46] :D [14:02:00] the l10n cache got regenerated, so that is a few GBytes to transfer around [14:03:12] yarp! [14:03:47] sync-apaches: 57% (ok: 173; fail: 0; left: 127) [14:04:05] la la la la [14:05:18] * hashar whistles [14:05:33] "allons enfant de la patriieeeeee..." [14:05:37] haha [14:05:42] cdb rebuild [14:06:36] 06Operations, 10ops-eqiad: mw1264 inaccessible after reboot - https://phabricator.wikimedia.org/T164725#3248119 (10Cmjohnson) 05Open>03Resolved idrac was not responding, performed system reboot and drained flea power....all is well. System is back up. [14:06:36] !log Run pt-table-checksum on s7.fawiki - T163190 [14:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:43] T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190 [14:08:19] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3248124 (10Cmjohnson) @Marostegui Let me know if you want to do the bbu swap today? [14:08:39] scap-cdb-rebuild: 99% (ok: 317; fail: 0; left: 2) [14:09:19] (03CR) 10Ema: [C: 031] "Looks good and puppet compiler agrees https://puppet-compiler.wmflabs.org/6329/" [puppet] - 10https://gerrit.wikimedia.org/r/352826 (owner: 10BBlack) [14:09:33] !log Stop MySQL and shutdown db1048 (phabricator slave) to replace BBU - T160731 [14:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:42] T160731: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731 [14:10:44] !log hashar@tin Finished scap: TwoColConflict update (duration: 19m 30s) [14:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:53] addshore: CFisch_WMDE : scap done [14:11:03] ack [14:11:12] \o/ [14:11:43] then I guess we can enable it on all wikis [14:12:19] hmmmm [14:12:34] I'm not seeing some of the new messages appearing [14:12:34] hmmmm like there is some little glitches [14:12:42] or HMMMMM like wikis are down ? :D [14:12:54] minor glitches [14:13:01] don't be giving me a heart attack like that hashar [14:13:02] could be JS needs to hurry up [14:13:05] I can rsync l10n maybe [14:13:16] yeah js probably is cached for 5 minutes or so [14:13:30] apergos: dont worry :-} if it is mw related i am in charge! [14:13:33] PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:13:33] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [14:13:41] addshore: yeah I alos do not see messages loaded by RS for the JS popup - could also be RS caching? o.O [14:13:44] * apergos runs away right after those criticals [14:14:02] try with ?debug=1 ? [14:14:50] hhmm,, its actually very hard to get debug=1 in the place where the mesages are shown! [14:15:13] But it should be there all the other messages are there and they are also part of the patch [14:15:58] missing new messages thouse, the last slide of the help dialogue is also missing [14:16:12] (03CR) 10Ema: [C: 031] varnish: move nuke/lru and exp_thread stuff to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/352827 (owner: 10BBlack) [14:16:13] also loaded via RS [14:16:21] !log reboot kafka1001 for kernel upgrades (eventbus codfw) [14:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:54] !log correction: reboot kafka2001 for kernel upgrades (eventbus codfw) [14:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:02] sorry sorry [14:17:14] hmmm [14:17:32] 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3248147 (10ayounsi) 05Open>03Resolved Now that LibreNMS has been upgraded to the most recent version, I've been able to poke more at thi... [14:17:40] I'm checking what's up with dbproxy alerts [14:17:47] jynus marostegui ^ FYI [14:17:54] addshore: CFisch_WMDE want me to redo the l10n update? [14:17:56] sync-l10n [14:18:04] hashar: sure, (how long does that take)? :P [14:18:05] godog: yeah, we have shutodnw db1048 [14:18:20] marostegui: ah ok, known then [14:18:21] for maintenance [14:18:25] where is the alert? [14:18:31] ah right [14:18:35] sorry yes, it is me [14:18:37] yeah, proxy alerts when a component is down [14:18:50] because otherwise we wouldn't notice it [14:18:59] yeah, i was checking my phone sorry :) [14:19:11] but they are proxies, so there is nothing to do [14:19:23] !log hashar@tin Started scap: (no justification provided) [14:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:30] they will come back when the db comes back [14:19:38] addshore: I am doing a full scap again [14:19:42] ack [14:19:42] (03PS1) 10BBlack: maps->upload: move IP, delete maps-specific things [puppet] - 10https://gerrit.wikimedia.org/r/352834 (https://phabricator.wikimedia.org/T164608) [14:19:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:20:26] there is a spike in 503 [14:21:12] <_joe_> jynus: sustained? [14:21:15] there is https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X if it can help [14:21:23] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:21:26] _joe_, I am looking [14:21:33] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [14:21:34] <_joe_> looks like something serious [14:22:14] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-6h&to=now [14:22:23] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:22:33] <_joe_> we have an increased number of 500 too [14:22:33] !log hashar@tin Finished scap: (no justification provided) (duration: 03m 10s) [14:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:43] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [14:23:07] they're 503s mostly in the varnish reporting (rather than 500) [14:23:09] https://grafana.wikimedia.org/dashboard/db/production-logging?refresh=5m&orgId=1&from=now-1h&to=now shows a spike of errors in apache syslog ? [14:23:11] I think it is gone [14:23:19] but it wwas more than esams [14:23:34] but sometimes backend errors that look like 500s there can result in 503s generated by varnish because it can't successfully make requests [14:23:34] PROBLEM - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:23:34] RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0 [14:23:34] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [14:23:37] the 5xx are gone [14:23:43] the 500 are still up [14:23:43] PROBLEM - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.46 and port 9042: Connection refused [14:23:48] hashar: messages still seem to be missing [14:23:50] :/ [14:23:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [14:23:54] <_joe_> bblack: yes, apart from 5xx, there is a baseline of 500s [14:23:57] addshore: :( [14:24:03] PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused [14:24:03] PROBLEM - cassandra-a SSL 10.192.48.46:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:24:04] PROBLEM - mediawiki-installation DSH group on mw1264 is CRITICAL: Host mw1264 is not in mediawiki-installation dsh group [14:24:13] PROBLEM - cassandra-b SSL 10.192.48.47:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:24:14] PROBLEM - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.47 and port 9042: Connection refused [14:24:19] 20% of 500 is a lot of them [14:24:35] <_joe_> what's up with mw1264? [14:24:38] <_joe_> moritzm: ? [14:24:57] from apache2/syslog roughly ~2000 messages AH01079: failed to make connection to backend: 127.0.0.1 [14:25:05] around 14:16 / 14:17 [14:25:23] <_joe_> hashar: from which servers? [14:25:31] <_joe_> everyone or a few specific ones? [14:25:39] <_joe_> that would seem a general crash of appservers [14:25:40] a bunch [14:25:49] <_joe_> all? [14:25:50] mw1196 / mw1199 / mw1222 etc [14:25:53] 06Operations, 10ops-eqiad, 13Patch-For-Review: rack and cable frlog1001 - https://phabricator.wikimedia.org/T163127#3248163 (10Cmjohnson) Could the fact that it's not installing and still reaching out to the tftp server have anything to do with it? Just updated the raid cfg so @jgreen can install when he's... [14:26:05] I would say yeah pretty much any mw1xxx [14:26:23] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:26:26] things are calming, I think? [14:26:34] https://logstash.wikimedia.org/goto/426f0dd61313029f13ed79ed8dfc7430 [14:26:39] with the exception of thumbs [14:26:45] I believe that was a one time spike [14:26:48] but not sure if more than usual [14:27:16] <_joe_> the timing corresponds with the 5xx spike [14:27:37] <_joe_> yoea I'd say it has to do with something happening to hhvm [14:27:38] with thumbs I mean upload domain in general [14:28:14] addshore: hashar Missing/Wrong l18n are defenitly related to RessourceLoader loading them for the JS parts, all the other messages are there so I would assume its strange RL caching - even with debug=1 things are missing... but that fits to my past experience with RL ... [14:28:19] now it is query.wikidata.org [14:29:00] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&orgId=1&panelId=15&fullscreen&from=now-1h&to=now [14:29:53] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:30:33] CFisch_WMDE: addshore: in logstash there are a bunch of "Failed to find XXX (lang)" messages. But they are from around 14:04 or 26 minutes ago [14:30:33] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:30:45] *looks* [14:30:54] hashar: can you link? or? [14:31:13] probably me refreshing to early ^^ [14:31:26] https://logstash.wikimedia.org/goto/f9e56ce3dc44b5b3fe50572c16f4d288 [14:31:31] if you have access to logstash [14:31:33] urandom,mobrovac - something weird happened to restbase2005 [14:32:08] hashar: yeh, I think that was when you did the sync but not the full scap and I guess me and CFisch_WMDE loaded it :) [14:32:25] (03PS2) 10BBlack: varnish: align on 1d TTL work [puppet] - 10https://gerrit.wikimedia.org/r/352826 [14:32:41] (03CR) 10BBlack: [V: 032 C: 032] varnish: align on 1d TTL work [puppet] - 10https://gerrit.wikimedia.org/r/352826 (owner: 10BBlack) [14:32:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [14:33:23] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:33:43] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:35:34] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3248207 (10Marostegui) 05Open>03Resolved @Cmjohnson has changed the battery, we will see how it goes. ``` root@db1048:~# megacli -AdpBbuCmd -a0 BBU stat... [14:35:39] (03PS1) 10Cmjohnson: removing dns entries for db1040 *decom* T164057 [dns] - 10https://gerrit.wikimedia.org/r/352840 [14:36:16] (03CR) 10Cmjohnson: [C: 032] removing dns entries for db1040 *decom* T164057 [dns] - 10https://gerrit.wikimedia.org/r/352840 (owner: 10Cmjohnson) [14:36:43] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2075753 [14:36:57] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3248215 (10Cmjohnson) [14:36:59] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1040 - https://phabricator.wikimedia.org/T164057#3248213 (10Cmjohnson) 05Open>03Resolved wiped, removed from rack, switch was already updated, racktables updated. [14:37:04] hashar: lets just leave it as it is for now and I'lll keep refreshing and hope resourceloader sorts itself out and then i can do the config later if everything sorts itself out [14:37:10] 06Operations, 10ops-eqiad, 10DBA, 10Phabricator, 13Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3248218 (10Marostegui) I am going to leave m3-slave pointing to the codfw master, until tomorrow just in case. If the host goes fine overnight, I will revert... [14:37:10] addshore: sounds good [14:37:17] addshore: and tomorrow we can enable it wiki wide [14:37:32] maybe this eeeeevening, things might be sorted out after the train [14:37:39] but i can do that :) [14:37:44] _joe_: mw1264 down with hardware error, but apparently Chris just fixed it [14:38:34] moritzm: you restarting cassandra on rb2005? [14:39:33] mobrovac: no, not yet, but I was about to shortly, why? [14:39:42] addshore: neat. And maybe try to pock someone knowing RL a bit more [14:39:56] !log varnish: varnishadm runtime set default_ttl=86400 for text+upload fe+be layers via cumin, to match deployed start-time changes in https://gerrit.wikimedia.org/r/#/c/352826/ [14:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:16] moritzm: cass on rb2005 is crapping out then :) [14:40:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:41:08] hashar: yup! [14:41:09] (03PS1) 10Volans: Transports: move BaseWorker helper methods to module functions [software/cumin] - 10https://gerrit.wikimedia.org/r/352841 (https://phabricator.wikimedia.org/T164838) [14:41:11] (03PS1) 10Volans: Transports: add Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/352842 (https://phabricator.wikimedia.org/T164838) [14:41:13] (03PS1) 10Volans: Transports: use Command class for commands [software/cumin] - 10https://gerrit.wikimedia.org/r/352843 (https://phabricator.wikimedia.org/T164838) [14:41:15] (03PS1) 10Volans: Transports: allow to specify a timeout per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352844 (https://phabricator.wikimedia.org/T164838) [14:41:16] 06Operations, 10Monitoring, 10netops, 13Patch-For-Review: nagios monitor transit/peering links and alert on low/high traffic - https://phabricator.wikimedia.org/T80273#3248224 (10ayounsi) 05Open>03Resolved As we have some link with 0.2% of outbound traffic, I added a LibreNMS rule to alert if any traff... [14:41:19] (03PS1) 10Volans: Transports: allow to specify exit codes per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352845 (https://phabricator.wikimedia.org/T164838) [14:42:17] mobrovac: that's unrelated to me :-) I had rebooted 2002-2004, but withheld 2005 so far since storage hints were stilly pretty high [14:42:54] kk thnx moritzm [14:44:13] maybe the reboot of 2002-2004 distabilized it? [14:44:31] mobrovac: should we bring up the cassandra instances one at the time [14:44:34] ? [14:45:04] CFisch_WMDE: addshore: so we hold on enabling two col everywhere right ? ( https://gerrit.wikimedia.org/r/#/c/350847/ ) [14:45:09] not likely elukey, no logs that indicate that in CP [14:45:26] (03PS1) 10Marostegui: Revert "wmnet: Point m3 slave to codfw master" [dns] - 10https://gerrit.wikimedia.org/r/352846 [14:45:27] hashar: yupo [14:45:29] (03PS2) 10Marostegui: Revert "wmnet: Point m3 slave to codfw master" [dns] - 10https://gerrit.wikimedia.org/r/352846 [14:45:40] (03CR) 10Hashar: "On hold since bunch of l10n messages from I04083a518bd0b17aefe423291702b61e804acbb3 do not appear properly on wikis. Probably related to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350847 (owner: 10Addshore) [14:45:42] neat [14:45:46] !log European SWAT completed [14:45:51] ty [14:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:07] elukey: puppet should bring cassandra back, but we can do it manually as well [14:46:11] urandom: are you on rb2005? [14:46:29] he is not logged in [14:46:35] (on rb2005) [14:46:57] yeah, I meant whether he is looking into it [14:47:00] (03PS5) 10Madhuvishy: gridengine: Follow up - delete old maintenance scripts and tracker/collector puppet code [puppet] - 10https://gerrit.wikimedia.org/r/352301 (https://phabricator.wikimedia.org/T162955) [14:47:13] let's bring up one instance at the time ok? I'll disable puppet [14:47:29] mobrovac: --^ [14:47:46] 06Operations, 10Monitoring, 10netops, 13Patch-For-Review: Juniper monitoring - https://phabricator.wikimedia.org/T83992#3248266 (10ayounsi) A new check has been added to LibreNMS to monitor "show system alarms" (yellow and red) As well as all the moving parts (PSU/FAN/etc...) [14:47:52] or maybe we can leave puppet disabled and wait for Eric [14:48:09] (03CR) 10jerkins-bot: [V: 04-1] gridengine: Follow up - delete old maintenance scripts and tracker/collector puppet code [puppet] - 10https://gerrit.wikimedia.org/r/352301 (https://phabricator.wikimedia.org/T162955) (owner: 10Madhuvishy) [14:48:27] elukey: ok, let's disable puppet there first and then bring one of the instances back up [14:48:46] 06Operations, 10Traffic, 13Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3248269 (10BBlack) @elukey - I think the only real analytics fallout here is that the data that is currently feeding to you as `webrequest_maps` will become data that's mix... [14:49:04] ACKNOWLEDGEMENT - MD RAID on elastic2020 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T164841 [14:49:07] 06Operations, 10ops-codfw: Degraded RAID on elastic2020 - https://phabricator.wikimedia.org/T164841#3248271 (10ops-monitoring-bot) [14:49:50] 06Operations, 10Traffic, 13Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3248278 (10elukey) I am going to ask to my team and report back asap! [14:50:08] papaul: are you working on elastic2020 ? (see alert above) [14:50:34] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2096819 [14:51:41] mobrovac: wasn't that just a restart to upgrade the jdk? [14:51:53] (03CR) 10Marostegui: [C: 032] Revert "wmnet: Point m3 slave to codfw master" [dns] - 10https://gerrit.wikimedia.org/r/352846 (owner: 10Marostegui) [14:52:00] mobrovac: it was all 3 instances, and the jdk version is newer [14:52:20] urandom: which host? [14:52:22] urandom: no, it wasn't a restart [14:52:23] mobrovac: doh, nevermind [14:52:52] mobrovac: 2005 [14:53:34] urandom: disabled puppet on rb2005 [14:53:38] just as FYI [14:53:47] I have already installed the new packages on 2005 in anticipation of a reboot, but then withheld since storage hints were still kinda high [14:54:04] gehel: yes had to get some info on the ssd [14:54:07] but the cassandra processes running are still using the old Java [14:54:25] s/are/were/ - they died :) [14:54:25] papaul: good! [14:54:36] gehel:thanks [14:55:10] !log repooled mw1264 after hardware error has been fixed (and scap pull) [14:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:54] mobrovac, elukey: so... just to make sure we don't clash, are either of you doing anything with this atm? [14:56:40] it looks like all 3 instances are up atm [14:57:45] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3248307 (10Cmjohnson) [14:58:28] urandom: not doing anything [14:58:45] mobrovac? [14:58:56] it looks like -a is draining, or tried to? [14:59:08] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3248318 (10fgiunchedi) @gilles indeed ! I'll figure out the best way to get a list of objects to delete and batch-delete I took another look at t... [14:59:38] i think this is the case for all 3 [14:59:58] moritzm: was a restart of the 2005 instances attempted? [15:00:42] no, haven't done anything on those expect installing the new jdk/kernel [15:00:55] it looks like they've been drained, but left running [15:01:01] i'm going to start restarting them [15:01:06] woa [15:01:12] ah yes, they were in fact drained, forgot about that [15:01:32] in preparation of the reboot, which was then postponed due to storage hints [15:01:54] storage hints? [15:02:13] RECOVERY - cassandra-a SSL 10.192.48.46:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-a valid until 2017-09-12 15:35:32 +0000 (expires in 126 days) [15:02:24] !log starting instances restbase2005 [15:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:43] RECOVERY - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is OK: TCP OK - 0.000 second response time on 10.192.48.46 port 9042 [15:03:24] RECOVERY - cassandra-b SSL 10.192.48.47:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-b valid until 2017-09-12 15:35:35 +0000 (expires in 126 days) [15:03:50] urandom: as per the linked grafana board from https://wikitech.wikimedia.org/wiki/Service_restarts#Cassandra_.28as_used_in_aqs_and_restbase.29 [15:03:53] RECOVERY - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is OK: SSL OK - Certificate restbase2005-c valid until 2017-09-12 15:35:38 +0000 (expires in 126 days) [15:04:03] RECOVERY - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is OK: TCP OK - 0.000 second response time on 10.192.48.48 port 9042 [15:04:13] RECOVERY - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is OK: TCP OK - 0.002 second response time on 10.192.48.47 port 9042 [15:07:32] (03PS1) 10Cmjohnson: Removing dns entries for decom host gadolinium T164036 [dns] - 10https://gerrit.wikimedia.org/r/352848 [15:07:49] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom host gadolinium T164036 [dns] - 10https://gerrit.wikimedia.org/r/352848 (owner: 10Cmjohnson) [15:08:32] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: decommission gadolinium - https://phabricator.wikimedia.org/T164036#3248377 (10Cmjohnson) [15:09:10] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: decommission gadolinium - https://phabricator.wikimedia.org/T164036#3219495 (10Cmjohnson) 05Open>03Resolved Completed [15:12:27] urandom: how did you check that the instances were in drained state? [15:12:34] (curiosity) [15:15:36] !log uploaded kubernetes 1.5.5-1+wmf1 to stretch-wikimedia/experimental [15:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:13] (03PS1) 10Cmjohnson: Removing mgmt dns entries for decom hosts europium and ytterbium t153918 and t141415:wq [dns] - 10https://gerrit.wikimedia.org/r/352849 [15:20:35] !log installing rpcbind/libtirpc security updates on ms1001 [15:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:09] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for decom hosts europium and ytterbium t153918 and t141415:wq [dns] - 10https://gerrit.wikimedia.org/r/352849 (owner: 10Cmjohnson) [15:21:59] (03CR) 10Andrew Bogott: "We can deploy this on labtestwikitech beforehand as a test if anyone is worried about breakage." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352722 (https://phabricator.wikimedia.org/T53642) (owner: 10Reedy) [15:24:03] RECOVERY - mediawiki-installation DSH group on mw1264 is OK: OK [15:25:14] 06Operations, 13Patch-For-Review, 07Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#3248433 (10fgiunchedi) Grafana supports mixed datasources so in theory we can combine graphite/prometheus in a single graph/panel, I haven't tried it though. WRT retention indeed after the one year re... [15:25:57] 06Operations, 10ops-eqiad, 10hardware-requests: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#3248445 (10Cmjohnson) [15:26:11] 06Operations, 10Gerrit, 06Release-Engineering-Team: replace gerrit server (ytterbium) with jessie server (lead) - https://phabricator.wikimedia.org/T125018#3248449 (10Cmjohnson) [15:26:13] 06Operations, 10ops-eqiad, 10hardware-requests: decom ytterbium (datacenter) - https://phabricator.wikimedia.org/T141415#2497714 (10Cmjohnson) 05Open>03Resolved racktables updated [15:28:00] (03PS1) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [15:28:52] 06Operations: Setup europium as locke replacement - https://phabricator.wikimedia.org/T82239#3248461 (10Cmjohnson) [15:28:54] 06Operations, 10ops-eqiad: decom europium (was: reclaim europium) - https://phabricator.wikimedia.org/T153918#3248458 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson updated racktables [15:29:15] (03CR) 10jerkins-bot: [V: 04-1] restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 (owner: 10Giuseppe Lavagetto) [15:33:22] <_joe_> uhm why didn't my local checks catch this? [15:33:33] PROBLEM - Host analytics1027 is DOWN: PING CRITICAL - Packet loss = 100% [15:33:53] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2003894 [15:34:13] 1027 is decommed, silencing it [15:34:23] I was actually about to ask that [15:34:32] I know you were decomming a bunch of hosts [15:34:36] *knew [15:35:03] elukey: sorry I just assumed it was already removed [15:35:10] nono my bad sorry [15:38:42] (03PS2) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [15:39:52] (03PS1) 10Cmjohnson: Remvoing all dns entries for decom host analyitcs1027 T161597 [dns] - 10https://gerrit.wikimedia.org/r/352853 [15:42:04] (03CR) 10Cmjohnson: [C: 032] Remvoing all dns entries for decom host analyitcs1027 T161597 [dns] - 10https://gerrit.wikimedia.org/r/352853 (owner: 10Cmjohnson) [15:43:01] 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3248535 (10Cmjohnson) [15:44:38] (03PS3) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [15:44:40] (03PS3) 10Elukey: Remove mgmt dns records for mw2090->mw2096 [dns] - 10https://gerrit.wikimedia.org/r/350813 (https://phabricator.wikimedia.org/T161488) [15:45:01] cmjohnson1: if you are updating the DNS we can also add https://gerrit.wikimedia.org/r/#/c/350813/ [15:45:49] 06Operations, 10LDAP-Access-Requests, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3248545 (10hashar) I have excluded the servicegroups based on an earlier comment. `/bin/false` looks legit indeed. Thanks. I have dig in puppe... [15:46:58] (03PS1) 10Cmjohnson: Removing analytics1027 from dhcpd file T161597 [puppet] - 10https://gerrit.wikimedia.org/r/352856 [15:47:24] (03CR) 10Cmjohnson: [C: 032] Remove mgmt dns records for mw2090->mw2096 [dns] - 10https://gerrit.wikimedia.org/r/350813 (https://phabricator.wikimedia.org/T161488) (owner: 10Elukey) [15:47:37] 06Operations, 06Labs: Investigate ceasing self-service new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3248549 (10Andrew) [15:47:43] super thanks [15:48:01] yw [15:48:15] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3248552 (10elukey) 05Open>03Resolved [15:48:17] (03CR) 10Cmjohnson: [C: 032] Removing analytics1027 from dhcpd file T161597 [puppet] - 10https://gerrit.wikimedia.org/r/352856 (owner: 10Cmjohnson) [15:49:25] 06Operations, 10LDAP-Access-Requests, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3248558 (10hashar) And OpenStackManager uses /bin/bash 6360ed954ee5488d0a4be3bcc156bdc0bc7543f4 ``` -$wgOpenStackManagerLDAPDefaultShell = '/usr/... [15:50:40] (03PS4) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [15:53:53] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 73581 [15:54:11] 06Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 06Labs, 10hardware-requests: Eqiad: Hardware request for labstore1006/7, dataset1002/3 - https://phabricator.wikimedia.org/T161311#3248606 (10faidon) [15:54:25] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3248609 (10faidon) [15:54:32] 06Operations, 10hardware-requests: EQIAD: 2 hardware access request for kubernetes-staging - https://phabricator.wikimedia.org/T162257#3248613 (10faidon) [15:55:18] 06Operations, 05Goal, 13Patch-For-Review, 07kubernetes: Eliminate SPOFs in the existing eqiad kubernetes infrastructure - https://phabricator.wikimedia.org/T162040#3248617 (10faidon) [15:57:20] 06Operations, 10OCG-General, 13Patch-For-Review: Tons of OCG jobs caused a massive increase in queue length - https://phabricator.wikimedia.org/T147211#3248627 (10elukey) 05Open>03Resolved [15:59:07] 06Operations, 10Traffic: Unprovision cache_misc @ ulsfo - https://phabricator.wikimedia.org/T164610#3248632 (10RobH) So I'll add a few options/items for review: * The router/switches are racked at the tops of the racks. If we move them to mid rack level, there are no power plugs at the middle of the rack to... [15:59:16] 06Operations, 10ops-codfw: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3248633 (10Papaul) [15:59:39] (03PS1) 10Giuseppe Lavagetto: Add profile::restbase fake private data [labs/private] - 10https://gerrit.wikimedia.org/r/352858 [16:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170509T1600). [16:00:08] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add profile::restbase fake private data [labs/private] - 10https://gerrit.wikimedia.org/r/352858 (owner: 10Giuseppe Lavagetto) [16:00:29] nothing for puppet swat [16:00:59] !log stopping Hadoop daemons and shutting down analytics[1032-1033,1040].eqiad.wmnet - T132256 [16:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:07] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3248654 (10Gilles) The top 100 most requested sizes represent 91.21% of all requests. The remaining long tail (any size not in that 100 whitelist)... [16:01:08] T132256: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256 [16:02:09] 06Operations, 10ops-codfw: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3248655 (10Papaul) [16:05:03] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 2057794 [16:08:21] (03PS3) 10Alexandros Kosiaris: lvs: Add the kubernetes master service/cluster [puppet] - 10https://gerrit.wikimedia.org/r/352580 (https://phabricator.wikimedia.org/T162040) [16:08:23] (03PS3) 10Alexandros Kosiaris: Migrate to using kubemaster.svc.$site.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/352581 [16:08:25] (03PS1) 10Alexandros Kosiaris: Use a service cert for kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/352860 [16:08:54] !log playing with mw2146 for T163674 [16:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:02] T163674: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674 [16:15:21] (03PS6) 10Madhuvishy: gridengine: Follow up - delete old maintenance scripts and tracker/collector puppet code [puppet] - 10https://gerrit.wikimedia.org/r/352301 (https://phabricator.wikimedia.org/T162955) [16:20:20] i do not know if this is done, but query.wikidata.org is consistenly throwing 500 errors [16:20:26] *known, not done [16:20:55] it is the high levels at https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1&from=1494340060234&to=1494346752631 [16:22:10] it seems to be /bigdata/ldf [16:28:44] 06Operations, 10LDAP-Access-Requests, 06Labs, 10Labs-Infrastructure: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#3248716 (10demon) Sillyshell entries can probably be replaced with `/bin/bash`, it doesn't exist anymore. It was a dummy wrapper that we used for... [16:33:23] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3248720 (10BBlack) APNIC has a good writeup here (first half is TCP history redux, second half goes into interesting details and new data on BBR)... [16:38:39] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3248727 (10elukey) Tried to strace all the nginx processes and this is the relevant part: ``` [pid 36881] connect(97, {sa_family=AF_INET, sin_port=htons(80), sin_addr=in... [16:40:36] (03PS1) 10Giuseppe Lavagetto: Fix typo [labs/private] - 10https://gerrit.wikimedia.org/r/352862 [16:41:55] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Fix typo [labs/private] - 10https://gerrit.wikimedia.org/r/352862 (owner: 10Giuseppe Lavagetto) [16:54:22] (03PS3) 10Elukey: check_hadoop_yarn_node_state: add syslog logging for CRITICAL states [puppet] - 10https://gerrit.wikimedia.org/r/347857 [16:54:29] (03CR) 10Elukey: [V: 032 C: 032] check_hadoop_yarn_node_state: add syslog logging for CRITICAL states [puppet] - 10https://gerrit.wikimedia.org/r/347857 (owner: 10Elukey) [16:54:53] Reedy: should we try to swat the patch to disable SMW on wikitech? [16:55:09] I'm busy until the SWAT window but I'm open then [16:56:03] or we could try it after 21:30Z (lots of meetings for me today) [16:58:15] <_joe_> lots of fun! [16:58:34] SMW = semantic MW correct? [16:58:59] (03PS5) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [16:59:04] yes, we are in the final stages of T53642 [16:59:04] T53642: Get rid of SemanticMediaWiki/SRF/SF from wikitech.wikimedia.org - https://phabricator.wikimedia.org/T53642 [16:59:23] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:59:47] Thats what I thought... good luck.... [16:59:49] here we go again with upload errors [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170509T1700). [17:00:24] no ores deployment [17:00:36] :) [17:01:33] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:02:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [17:04:38] (03PS6) 10Giuseppe Lavagetto: restbase: migration to role/profile for the dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/352851 [17:04:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [17:06:33] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:08:29] mostly thumbs ? [17:09:53] PROBLEM - HHVM rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [17:10:53] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 74372 bytes in 0.246 second response time [17:11:33] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:12:18] -- FetchError Could not get storage [17:12:26] -- ExpKill LRU_Fail [17:12:31] this is from cp1072 [17:12:37] bblack,ema --^ [17:15:03] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 5 [17:15:50] maybe T145661 [17:15:51] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [17:20:31] cp1074 as well [17:20:40] so might be confined to some upload backends [17:20:51] cp1049 seems fine [17:22:23] cp1064 seems fine as well [17:23:34] !log Preparing to branch 1.30.0-wmf.1 [ T162954 ] [17:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:42] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [17:23:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:24:29] this graph might help https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=29&fullscreen&orgId=1&from=now-3h&to=now&var-server=cp1074&var-datasource=eqiad%20prometheus%2Fops [17:25:57] !log executing varnish-backend-restart on cp1074 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661 [17:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:05] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [17:27:04] 06Operations, 10ops-codfw, 10hardware-requests: reclaim tempdb2001(WMF6407) to spares - https://phabricator.wikimedia.org/T164513#3248832 (10Papaul) Disk wipe in progress [17:27:37] 06Operations, 10ops-codfw: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3248836 (10Papaul) p:05Triage>03Normal [17:27:53] 06Operations, 10ops-codfw: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3248633 (10Papaul) [17:28:17] cp1074 completed the restart, looks better from the graph [17:29:20] !log executing varnish-backend-restart on cp1072 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661 [17:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:53] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:30:33] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [17:32:14] cp1072 done as well [17:36:05] from https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?var-site=All&var-cache_type=upload&var-status_type=5&orgId=1&from=now-3h&to=now the 50x seems down now [17:36:43] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0 [17:38:23] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:38:51] \o/ [17:39:15] let's wait for all the alarms to recover, but I'd say the culprit was cp107[24] [17:41:33] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:42:47] all right going afk now, ping me if needed! [17:58:53] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:04:15] bd808: We might aswell just try it yeah. Might be premature to stop branching just yet (though, this weeks branch should've been done already) [18:04:44] Well, we don't actually branch them :p [18:04:52] We just include said branches/tags in our branch :) [18:06:42] Well, yes, that :p [18:09:26] Reedy: If you have time to help I think we should take a time slot before the evening SWAT and try it out. We can stage, pull to testlabswiki to see if it melts obviously and then do wikitech [18:09:39] Sure [18:09:51] I'd try now, but I'm kind of hungry and have a pile of meetings still this afternoon [18:10:05] I'll put something in the deploy calenday [18:10:08] *r [18:10:52] (03PS1) 10Papaul: DNS: Add mgmt and production DNS entries for kubernetes200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/352869 [18:10:59] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS entries for kubernetes200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/352869 (owner: 10Papaul) [18:13:47] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3249024 (10Cmjohnson) analytics[1032-1033,1040].eqiad.wmnet have had the thermal paste replaced. One observation on all 3 is that... [18:14:21] mobrovac, starting now [18:14:29] k [18:15:14] !log ssastry@tin Started deploy [parsoid/deploy@0459ae3]: Updating Parsoid to 9d8badc8 [18:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:48] jouncebot: refresh [18:16:48] Reedy: I grabbed an hour at 22:00Z [18:16:50] I refreshed my knowledge about deployments. [18:17:20] cool [18:22:23] !log ssastry@tin Finished deploy [parsoid/deploy@0459ae3]: Updating Parsoid to 9d8badc8 (duration: 07m 09s) [18:22:30] mobrovac, success [18:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:28] subbu: \o/ [18:26:57] !log updated Parsoid to 9d8badc8 (T151277) [18:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:04] T151277: [[Media: ]] links are parsed as [[File: ]] links - https://phabricator.wikimedia.org/T151277 [18:29:06] (03PS2) 10Dzahn: DNS: Add mgmt and production DNS entries for kubernetes200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/352869 (owner: 10Papaul) [18:36:21] (03PS2) 10BBlack: varnish: move nuke/lru and exp_thread stuff to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/352827 [18:36:30] (03CR) 10BBlack: [V: 032 C: 032] varnish: move nuke/lru and exp_thread stuff to all clusters [puppet] - 10https://gerrit.wikimedia.org/r/352827 (owner: 10BBlack) [18:39:45] !log varnish: manually etting runtime lru_interval / nuke_limit via varnishadm for all clusters' backends to match start-time change in https://gerrit.wikimedia.org/r/#/c/352827/ [18:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170509T1900). Please do the needful. [19:07:02] (03PS1) 10Sfic: Import sources on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352873 (https://phabricator.wikimedia.org/T164573) [19:07:47] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3249214 (10BBlack) You might want to look at the other side of the nginx proxy as well. Perhaps apache is terminating its connection to the local nginx with RST, and thi... [19:16:33] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [19:19:33] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [19:30:46] (03CR) 10Dzahn: "for mgmt you are adding forward and reverse but for prod it's only reverse it seems" [dns] - 10https://gerrit.wikimedia.org/r/352869 (owner: 10Papaul) [19:42:24] !log maxsem@tin Started deploy [kartotherian/deploy@740235c]: https://gerrit.wikimedia.org/r/#/c/352886/ [19:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:59] !log maxsem@tin Finished deploy [kartotherian/deploy@740235c]: https://gerrit.wikimedia.org/r/#/c/352886/ (duration: 05m 35s) [19:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:22] (03CR) 10Chad: "Awight: If I put this on SWAT would you or someone from your team be able to help me verify it doesn't break anything? Would like to knock" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [20:03:23] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 28 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[hadoop-yarn-nodemanager],Service[hadoop-hdfs-datanode] [20:04:18] checking --^ [20:04:24] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:07:39] (03PS1) 10Amire80: Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) [20:14:21] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3249434 (10elukey) @Cmjohnson thanks! analytics1040 shows up memory errors on boot, I wasn't able to power it on.. Do you mind to c... [20:14:23] (03PS2) 10Dzahn: ocg: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352664 [20:19:30] 06Operations, 10hardware-requests: codfw: (1) labtest puppetmaster - https://phabricator.wikimedia.org/T164515#3249473 (10chasemp) [20:20:04] (03CR) 10Dzahn: [C: 032] ocg: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352664 (owner: 10Dzahn) [20:22:09] (03PS1) 10Volans: ClusterShell: allow to specify exit codes per Command [software/cumin] - 10https://gerrit.wikimedia.org/r/352892 (https://phabricator.wikimedia.org/T164833) [20:22:39] !log twentyafterfour@tin Started scap: MediaWiki sync new branch wmf/1.30.0-wmf.1 + localization cache and deploy to testwikis refs T162954 [20:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:47] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [20:25:30] (03PS2) 10Dzahn: snapshot: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352657 [20:25:51] (03CR) 10Dzahn: [C: 032] snapshot: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352657 (owner: 10Dzahn) [20:28:57] 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3136552 (10Dzahn) @elukey system is still in Icinga, causing alerts, has the puppet/salt part been done ? decom tasks should ideally have the full check list fr... [20:30:09] ACKNOWLEDGEMENT - Host analytics1027 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T161597 [20:30:14] (03PS1) 10Madhuvishy: sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [20:31:37] (03CR) 10jerkins-bot: [V: 04-1] sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [20:33:48] 06Operations, 10Traffic, 10Wikimedia-Site-requests, 07HTTPS: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3249552 (10Framawiki) [20:34:17] 06Operations, 10Traffic, 10Wikimedia-Site-requests, 07HTTPS: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3249567 (10Framawiki) [20:35:14] (03CR) 10Andrew Bogott: [C: 032] Nova policy: Open up quota-related queries [puppet] - 10https://gerrit.wikimedia.org/r/352606 (https://phabricator.wikimedia.org/T164332) (owner: 10Andrew Bogott) [20:35:30] (03CR) 10Andrew Bogott: [C: 04-1] "I want to read the code and figure out what those 'limits' policies actually do before we merge this." [puppet] - 10https://gerrit.wikimedia.org/r/352606 (https://phabricator.wikimedia.org/T164332) (owner: 10Andrew Bogott) [20:35:56] (03PS2) 10Madhuvishy: sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [20:37:18] (03CR) 10jerkins-bot: [V: 04-1] sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [20:41:40] (03PS3) 10Madhuvishy: sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [20:42:47] (03CR) 10jerkins-bot: [V: 04-1] sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [20:45:42] (03PS3) 10Papaul: DNS: Add forward production DNS entries for kubernetes200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/352869 [20:45:45] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 07HTTPS: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3249633 (10Urbanecm) [20:46:54] (03PS4) 10Madhuvishy: sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [20:50:43] (03PS1) 10BBlack: ssl_ciphersuite: remove DHE-RSA-AES128-GCM-SHA256 [puppet] - 10https://gerrit.wikimedia.org/r/352924 [20:52:20] !log twentyafterfour@tin Finished scap: MediaWiki sync new branch wmf/1.30.0-wmf.1 + localization cache and deploy to testwikis refs T162954 (duration: 29m 41s) [20:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:28] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [20:53:24] (03PS4) 10Dzahn: DNS: Add mgmt and production DNS entries for kubernetes200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/352869 (owner: 10Papaul) [20:53:40] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS entries for kubernetes200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/352869 (owner: 10Papaul) [20:55:44] (03PS1) 10BBlack: sslcert: regenerate dhparam.pem [puppet] - 10https://gerrit.wikimedia.org/r/352932 [20:56:23] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: remove DHE-RSA-AES128-GCM-SHA256 [puppet] - 10https://gerrit.wikimedia.org/r/352924 (owner: 10BBlack) [20:57:21] (03PS2) 10BBlack: sslcert: regenerate dhparam.pem [puppet] - 10https://gerrit.wikimedia.org/r/352932 [20:57:25] (03CR) 10BBlack: [V: 032 C: 032] sslcert: regenerate dhparam.pem [puppet] - 10https://gerrit.wikimedia.org/r/352932 (owner: 10BBlack) [21:14:33] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:17:33] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [21:20:14] (03PS3) 10Dzahn: snapshot: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352657 [21:21:32] (03PS4) 10Dzahn: snapshot: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352657 [21:21:42] (03CR) 10Dzahn: [V: 032 C: 032] snapshot: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352657 (owner: 10Dzahn) [21:22:31] i see "- 'DHE-RSA-AES128-GCM-SHA256', [21:22:35] :) [21:24:29] all remaining ciphers are ECDHE, pretty cool [21:25:53] (03Abandoned) 10Dzahn: ocg: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352633 (owner: 10Dzahn) [21:27:05] (03Abandoned) 10Dzahn: base::puppet: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352654 (owner: 10Dzahn) [21:30:08] (03PS1) 1020after4: group0 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352962 [21:30:10] (03CR) 1020after4: [C: 032] group0 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352962 (owner: 1020after4) [21:30:29] (03PS3) 10Dzahn: mariadb: clean up duplicate GRANTs for phstats user [puppet] - 10https://gerrit.wikimedia.org/r/348779 [21:31:10] (03Merged) 10jenkins-bot: group0 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352962 (owner: 1020after4) [21:31:22] (03CR) 10jenkins-bot: group0 wikis to 1.30.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352962 (owner: 1020after4) [21:32:32] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.30.0-wmf.1 [21:32:37] !log group0 wikis to 1.30.0-wmf.1 refs T162954 [21:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:48] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [21:33:45] (03CR) 10Dzahn: "thanks for merging the last change that granted new permissions and was a dependency. now manually rebased and ready to go as well at your" [puppet] - 10https://gerrit.wikimedia.org/r/348779 (owner: 10Dzahn) [21:36:09] (03PS6) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) [21:40:58] !log Mediawiki train group0 finished, will resume tomorrow with group 1 wikis. refs T162954 [21:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:07] T162954: MW-1.30.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T162954 [21:53:02] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3249822 (10Papaul) Hello Papaul, Thank you for contacting Dell EMC. The following information includes the applicable case and dispatch numbers related to our conversation: Service Request #: 948095122 Service... [22:00:04] bd808 and Reedy: Dear anthropoid, the time has come. Please deploy Undeploy Semantic* from wikitech wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170509T2200). [22:00:11] * Reedy grins [22:00:28] * RainbowSprinkles grabs the popcorn [22:00:46] (03PS3) 10Reedy: Undeploy Semantic* from wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352722 (https://phabricator.wikimedia.org/T53642) [22:00:52] (03CR) 10Reedy: [C: 032] Undeploy Semantic* from wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352722 (https://phabricator.wikimedia.org/T53642) (owner: 10Reedy) [22:02:07] (03Merged) 10jenkins-bot: Undeploy Semantic* from wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352722 (https://phabricator.wikimedia.org/T53642) (owner: 10Reedy) [22:02:19] (03CR) 10jenkins-bot: Undeploy Semantic* from wikitech wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352722 (https://phabricator.wikimedia.org/T53642) (owner: 10Reedy) [22:02:59] * twentyafterfour cheers [22:03:37] !log reedy@tin Started scap: (no justification provided) [22:03:40] !log reedy@tin scap aborted: (no justification provided) (duration: 00m 03s) [22:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:12] hmm [22:04:16] andrewbogott: About? [22:04:16] ? [22:04:16] The last Puppet run was at Mon May 1 21:40:49 UTC 2017 (11511 minutes ago). Puppet is disabled. andrew messing with sudo/horizon gui [22:04:24] 22:03:15 Copying to labtestweb2001.wikimedia.org from naos.codfw.wmnet [22:04:40] Reedy: yep, I'm here [22:04:48] I don't think you need puppet to scap on that system do you? [22:04:59] andrewbogott: well, it's trying to pull from the wrong server [22:05:00] (I can stash and enable puppet if you need it but am in the middle of things) [22:05:05] oh, ok [22:05:07] um… stay tuned! [22:05:23] so i'd have to pull to tin, pull to naos, pull to labtestweb2001... [22:05:25] etc :) [22:06:08] Or we just break wikitech :P [22:06:48] oh, wait [22:06:51] scap pull tin? [22:06:53] o/ [22:07:04] $ scap pull tin.eqiad.wmnet [22:07:04] 22:06:57 Copying to labtestweb2001.wikimedia.org from tin.eqiad.wmnet [22:07:07] andrewbogott: nvm :) [22:07:12] Pulling from wrong host should be fine anyway [22:07:19] Cuz you master sync first [22:07:23] well anyway I re-enabled puppet so default should be fine [22:08:30] Er, actually the co-master might not have scap pull'd yet [22:08:31] Hmm [22:08:53] No semantic on https://labtestwikitech.wikimedia.org/wiki/Special:Version [22:08:59] "scap pull tin.eqiad.wmnet" would be correct [22:09:05] it worked too :P [22:09:14] Lots of obviously broken stuff on the test wiki, so they need removing out at some point [22:09:38] the crusty old main page is no full of busted {{#ask ...}} queries :) [22:09:50] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3249939 (10Papaul) [22:10:18] That's actually a bug I should fix tho [22:10:28] We should scap pull on the co-master before everywhere else [22:10:40] Just to be safe, in case a node has an out-of-date master [22:10:48] https://labtestwikitech.wikimedia.org/w/index.php?title=Main_Page&type=revision&diff=28393&oldid=2 [22:11:45] I don't think I've got working 2fa on that wiki [22:13:10] I have a live session. shoudl I be scared to log out and log back in? [22:15:58] I just logged in fine [22:17:52] bd808: I did a thing: https://phabricator.wikimedia.org/D635 [22:18:35] Reedy: I don't see anything seriously busted on labtestwikitech :) [22:20:09] !log reedy@tin Synchronized wmf-config/wikitech.php: Disable Semantic extensions (duration: 00m 42s) [22:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:47] wikitech seems a-ok to me. Anything funky in the error logs [22:25:11] I can't read error.log on silver... [22:25:39] I don't have root there either. :/ [22:25:49] oooh, fancy new kibana [22:26:37] Does silver not log to logstash? :( [22:26:46] Reedy: it's owner mwdeploy:mwdeploy `sudo -u mwdeploy ` works to read it [22:26:52] Warning: Memcached::touch(): touch is only supported with binary protocol [22:27:00] it should go to logstash, yes [22:27:04] that one is known [22:27:07] yeah [22:27:26] that's the main one on logstash [22:28:28] some complaints in the logs from my using the service groups UI. Doesn't look related though [22:29:40] Im wondering is there a task for upgrading silver from trusty to jessie? [22:29:57] paladox: there is new hardware coming [22:30:02] if only phab had a search facility [22:30:05] so yes, but indirectly [22:30:07] thanks. [22:31:26] silver and californium are going to get shiny new replacements and actually become a HA pair if I recall the plans [22:31:58] mostly I want to make wikitech a SUL wiki though and put it in the normal cluster [22:32:38] PHP Notice: Undefined property: OpenStackNovaProject::$serviceUsers [22:33:01] PHP Notice: Undefined variable: servicemember_keys [22:33:03] bd808: At the very least, we can put it in the normal cluster soon (even if SUL unification takes longer) [22:33:04] Quite a few like this [22:33:24] Reedy: yeah. I think that was me. I'm guessing its been there for a while [22:33:48] RainbowSprinkles: we still need to remove OSM before it can go in the normal cluster [22:33:54] Yeah I know [22:33:57] Almost there! [22:34:09] this particular SUL unification *should* be a lot easier [22:34:23] Reedy i guess it must be this one https://github.com/wikimedia/mediawiki-extensions-OpenStackManager/blob/5f7df6921a011a5c54e243cff41073b8ee024268/nova/OpenStackNovaProject.php#L148 [22:34:27] at least for active users [22:38:36] We should move DynamicSidebar to just the general extension-list. [22:38:47] I could see other wikis wanting it [22:38:53] (03PS5) 10Madhuvishy: sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [22:39:49] It's exactly 1 message (the description) [22:39:50] Yeah [22:39:52] Imma do that [22:40:09] (03CR) 10jerkins-bot: [V: 04-1] sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [22:41:46] 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3250062 (10Cmjohnson) @dzahn, I jumped on this too early. None of those things were done prior to me doing my pet. Its now iwiped and all switch ports are disabled. [22:42:19] (03PS1) 10Chad: Move DynamicSidebar to extension-list and out of extension-list-wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352974 [22:43:07] https://gerrit.wikimedia.org/r/352973 should fix most of the crap in those errors [22:43:28] (03PS6) 10Madhuvishy: sge: Revamp queue configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [22:45:37] Reedy: Reviewed [22:45:56] (03CR) 10Chad: [C: 032] Move DynamicSidebar to extension-list and out of extension-list-wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352974 (owner: 10Chad) [22:46:13] !log reedy@tin Synchronized wmf-config/extension-list-wikitech: Consistency (duration: 00m 42s) [22:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:19] (03Merged) 10jenkins-bot: Move DynamicSidebar to extension-list and out of extension-list-wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352974 (owner: 10Chad) [22:48:28] (03CR) 10jenkins-bot: Move DynamicSidebar to extension-list and out of extension-list-wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352974 (owner: 10Chad) [22:48:33] T53642 is the sort of task where I don't like that phab tasks can't have multiple owners [22:48:33] T53642: Get rid of SemanticMediaWiki/SRF/SF from wikitech.wikimedia.org - https://phabricator.wikimedia.org/T53642 [22:48:54] bd808: sub-tasks, brother [22:49:13] !log demon@tin Started scap: rebuilding l10n for extension-list swap [22:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:13] greg-g: sure, but I actually really like these laundry list tasks too. Getting the ping when someone crosses something off often makes me want to do the same :) [22:54:09] bd808: that's the beauty of sub-tasks created from a bigger one, everyone is auto-subscribed ;) [22:54:34] * bd808 hands Reedy and RainbowSprinkles a glass of champaign to celebrate the demise of SMW on wikitech [22:54:40] yay! [22:55:11] now i need to get back to writing code that will move us closer to removing OSM too [22:55:43] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:59:43] (03PS2) 10Dzahn: rsyslog: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352656 [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170509T2300). Please do the needful. [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:03:16] (03PS7) 10Catrope: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) [23:03:23] (03CR) 10Catrope: [C: 032] Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [23:03:32] I'll do the SWAT since I'm the only customer [23:04:00] I'm mid-scap [23:04:00] (03PS1) 10Dzahn: site.pp: remove analytics1027, decom [puppet] - 10https://gerrit.wikimedia.org/r/352978 [23:04:18] (03PS1) 10Chad: DynamicSidebar: Use standard extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 [23:04:22] Oh whoops [23:04:28] Sorry for not checking [23:04:29] I'll wait [23:04:35] (03PS2) 10Dzahn: site.pp: remove analytics1027, decom [puppet] - 10https://gerrit.wikimedia.org/r/352978 [23:04:44] Great excuse for me to delay the SWAT and have some of Jeff's goodbye ice cream [23:04:51] (03CR) 10Dzahn: [C: 032] site.pp: remove analytics1027, decom [puppet] - 10https://gerrit.wikimedia.org/r/352978 (owner: 10Dzahn) [23:05:08] RoanKattouw: No worries, I'm in the middle of a jfdi thing [23:06:07] (03CR) 10jerkins-bot: [V: 04-1] DynamicSidebar: Use standard extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 (owner: 10Chad) [23:06:23] (03Merged) 10jenkins-bot: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [23:08:41] (03PS2) 10Chad: DynamicSidebar: Use standard extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352979 [23:09:41] Reedy bd808 does this $smwgNamespacesWithSemanticLinks[NS_NOVA_RESOURCE] = true; need to be removed from this https://gerrit.wikimedia.org/r/#/c/352979/2/wmf-config/wikitech.php file? [23:09:49] (03CR) 10jenkins-bot: Enable RCFilters beta feature on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347045 (https://phabricator.wikimedia.org/T144458) (owner: 10Catrope) [23:09:55] sounds like a semantic mediawiki type config [23:10:14] sounds like it [23:10:19] Ah yeah, that can go [23:11:09] ok [23:11:10] thanks [23:11:47] (03Draft1) 10Paladox: Wikitech: Remove $smwgNamespacesWithSemanticLinks config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352980 [23:11:48] (03PS2) 10Paladox: Wikitech: Remove $smwgNamespacesWithSemanticLinks config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352980 (https://phabricator.wikimedia.org/T53642) [23:13:04] !log analytics1027 - decom: revoke puppet cert, delete salt key, puppet node clean/deactivate, check icinga removal (T161597) [23:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:13] T161597: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597 [23:15:20] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1040 - https://phabricator.wikimedia.org/T164057#3220228 (10Dzahn) is still in puppet site.pp and Icinga [23:15:28] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3250128 (10Dzahn) [23:15:30] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1040 - https://phabricator.wikimedia.org/T164057#3250127 (10Dzahn) 05Resolved>03Open [23:18:42] (03PS1) 10Dzahn: site.pp: remove db1040, decom [puppet] - 10https://gerrit.wikimedia.org/r/352982 (https://phabricator.wikimedia.org/T164057) [23:23:23] !log demon@tin Finished scap: rebuilding l10n for extension-list swap (duration: 34m 10s) [23:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:37] RECOVERY - puppet last run on hydrogen is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [23:27:49] 06Operations, 10ops-eqiad, 10Analytics, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3250139 (10Dzahn) 05Open>03Resolved removed from site.pp / puppet / salt / icinga ^. that should be all now. i don't see it anywhere else. [23:29:32] RoanKattouw: Done btw [23:29:37] (03PS2) 10Dzahn: site.pp: remove db1040, decom [puppet] - 10https://gerrit.wikimedia.org/r/352982 (https://phabricator.wikimedia.org/T164057) [23:30:50] (03CR) 10Dzahn: [C: 032] "db1040.eqiad.wmnet is already gone from DNS and decom but was still here" [puppet] - 10https://gerrit.wikimedia.org/r/352982 (https://phabricator.wikimedia.org/T164057) (owner: 10Dzahn) [23:32:23] Biggest gripe that has been broken in kibana4, kibana5 (but worked just fine in kibana3).... clicking the kibana logo doesn't take you back to the default dashboard [23:32:30] There's no quick way back to the default dashboard [23:33:49] !log db1040 - remove from puppet, puppet node clean/deactivate, deleted salt-key, remove from icinga by running puppet on tegmen after that (T164057) [23:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:56] T164057: Decommission db1040 - https://phabricator.wikimedia.org/T164057 [23:34:46] Eh, clicking away from dashboard to another section (like visualization) then back does it [23:35:00] But if you're already looking at *a* dashboard, clicking dashboard takes you to the list of dashboards [23:35:08] Interesting that the button goes 2 places depending on your location [23:35:39] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3250147 (10Dzahn) [23:35:42] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: Decommission db1040 - https://phabricator.wikimedia.org/T164057#3250146 (10Dzahn) 05Open>03Resolved [23:35:50] Er, it's not consistent! [23:35:50] Yay [23:43:17] (03CR) 10Dzahn: [C: 032] rsyslog: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352656 (owner: 10Dzahn) [23:43:23] (03PS3) 10Dzahn: rsyslog: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352656 [23:44:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:47:45] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:49:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:54:21] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable RCFilters beta feature on all remaining wikis (T144458) (duration: 00m 44s) [23:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:29] T144458: Launch ERI RC page features as a Beta Feature to all wikis - https://phabricator.wikimedia.org/T144458 [23:58:59] (03PS2) 10Catrope: Enable ORES on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350488 (https://phabricator.wikimedia.org/T163011) [23:59:04] (03CR) 10Catrope: [C: 032] Enable ORES on fiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350488 (https://phabricator.wikimedia.org/T163011) (owner: 10Catrope)