[00:00:04] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:05:49] MaxSem: 19:38 < shinken-wm> PROBLEM - Free space - all mounts on integration-slave-jessie-1003 is CRITICAL: CRITICAL: integration.integration-slave-jessie-1003.diskspace._srv.byte_percentfree (<44.44%) [00:06:03] that is the same instance name that shows up in the failed jerkins-bot output [00:08:23] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/425723/ (duration: 01m 18s) [00:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:41] !log phabricator update will begin shortly, running a bit behind due to a massive upstream merge which will have to wait until later date. [00:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:44] !log jerkins-bot tests all return -1 due to operations-mw-config-php55lint failing which says it can't clone on integration-slave-jessie-1003, which is out of disk space in /srv as reported by shinken. it's mostly all /srv/pbuilder [00:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:08] !log preparing to deploy phabricator rPHDEP/release/2018-04-12/1 https://phabricator.wikimedia.org/project/view/3335/ [00:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:36] mutante: we can probably delete the pbuilder stuff? [00:15:39] I think so anyway [00:17:00] twentyafterfour: it seems somebody fixed it just now [00:17:07] or something automatic cleaned it up [00:17:17] a lot of things are not in /srv/jenkins-workspace/workspace anymore [00:17:52] and i was wrong, other stuff was larger than that.. but now it's gone. oh well [00:18:05] was about to create a ticket but then i won't worry [00:18:31] :) [00:18:46] !log phabricator will be offline for just a moment while I run the upgrade script. [00:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:45] twentyafterfour: how long will it be down for? [00:22:20] a moment while he runs the upgrade script [00:23:03] yes, I meant how long is "a moment" :p [00:23:05] !log phabricator is back [00:23:09] legoktm: sorry [00:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:16] the moment was longer than intended [00:23:53] TIL: a moment is defined as 90 seconds [00:24:01] seems like you made it [00:24:02] thanks :) [00:24:29] I had typed a decently long comment and then "Wikimedia server error" and that was a bit scary :P [00:24:31] my moment was ~4 minutes I think.... I did not know a moment had a precise definition [00:24:44] The movement of a shadow on a sundial covered 40 moments in a solar hour. [00:24:52] legoktm: Sorry about that, it should save your comment thanks to autosave? [00:25:12] I got it from the back button :) [00:25:56] back button keeping form content is an undervalued feature [00:26:02] we are getting close to the point where phabricator upgrades will just involve swapping between phab1001 and phab2001 seamlessly [00:26:25] chicocvenancio: indeed, and sometimes it requires a bunch of magic incantations to make it work [00:27:00] in the case of phabricator, it saves partially typed comments server-side and restores them when you lose connection and then come back later [00:27:38] back button on the task creation form isn't so user-friendly [00:27:51] last I checked, you lose the entire thing, sadly [00:28:54] yep that was my experience the last time I accidentally pressed back and forwards [00:31:49] twentyafterfour: yep, the server-side comment save has saved me when the battery of the notebook disconnected, very relieved to see the comment there when I got it back [00:33:06] !log phabricator: hotfixing DeadlineEditEngineSubtype.php [00:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:42] !log The hotfix that I deployed for phabricator: https://phabricator.wikimedia.org/rPHEX7801b519442eea2bfd47a272ba36959b487ae7d7 [00:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:40] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [00:57:12] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4125660 (10Dzahn) @Mholloway Try this please: - go to https://icinga.wikimedia.org/icinga/ - login using your wikitech - type... [01:17:40] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:05:28] (03Abandoned) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/425718 (owner: 10Bstorm) [02:38:47] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.28) (duration: 07m 20s) [02:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:08] (03PS1) 10Bstorm: wiki replicas: comment view scope fix [puppet] - 10https://gerrit.wikimedia.org/r/425741 (https://phabricator.wikimedia.org/T181650) [02:56:37] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/425742 [02:57:38] (03CR) 10Bstorm: [C: 032] wiki replicas: comment view scope fix [puppet] - 10https://gerrit.wikimedia.org/r/425741 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [03:35:49] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4125812 (10bearND) @Dzahn I just tried the "add a comment to checked services" command myself and it says Not Authorized. [03:36:09] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10Wikimedia-Incident: Deletion not working on English Wikipedia - https://phabricator.wikimedia.org/T191875#4125816 (10Anomie) [03:36:13] 10Operations, 10DBA, 10MediaWiki-Page-deletion, 10MW-1.31-release-notes (WMF-deploy-2018-04-17 (1.31.0-wmf.30)), and 2 others: Reduce locking contention on deletion of pages - https://phabricator.wikimedia.org/T191892#4125813 (10Anomie) 05Open>03Resolved a:03Anomie This should be resolved now, for de... [05:05:28] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The structure is sound, but please remove the create_resources call" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [05:08:10] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, and is also a beta-only change." [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [05:08:51] (03PS2) 10Marostegui: Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/425742 (owner: 10Bstorm) [05:09:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425745 [05:09:06] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425745 [05:09:42] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/425742 (owner: 10Bstorm) [05:11:46] !log Reload haproxy on dbproxy1011 to repool labsdb1009 [05:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425745 (owner: 10Marostegui) [05:13:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425745 (owner: 10Marostegui) [05:13:32] (03PS40) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [05:13:46] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425745 (owner: 10Marostegui) [05:15:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 after alter table (duration: 01m 18s) [05:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:09] PROBLEM - MariaDB Slave Lag: s8 on db2079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 18235.53 seconds [05:19:30] PROBLEM - MariaDB Slave Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 59357.70 seconds [05:19:40] PROBLEM - MariaDB Slave Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 10694.61 seconds [05:19:40] PROBLEM - MariaDB Slave Lag: s8 on db2082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 13550.25 seconds [05:19:47] That is a downtime that expired [05:21:24] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425746 (https://phabricator.wikimedia.org/T187089) [05:22:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425746 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:24:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425746 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:25:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1109 for alter table (duration: 01m 17s) [05:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:07] !log Deploy schema change on db1109 - T187089 T185128 T153182 [05:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:14] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [05:27:14] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:27:15] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:29:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425746 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:32:40] RECOVERY - MariaDB Slave Lag: s8 on db2080 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [05:34:23] !log Deploy schema change on s5 primary master (db1070) - T190780 [05:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:28] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [05:35:49] RECOVERY - MariaDB Slave Lag: s8 on db2082 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [05:39:10] !log Deploy schema change on s6 primary master (db1061) - T190780 [05:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:10] RECOVERY - MariaDB Slave Lag: s8 on db2079 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:45:45] !log Deploy schema change on s4 primary master (db1068) - T190780 [05:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:51] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [05:49:31] !log Deploy schema change on s8 primary master (db1071) - T190780 [05:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:26] !log Deploy schema change on s2 primary master (db1054) - T190780 [05:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:32] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [06:02:33] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, see small comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [06:02:50] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Transport endpoint is not connected [06:03:50] RECOVERY - Disk space on stat1005 is OK: DISK OK [06:05:47] !log force kill of fuse_dfs (handling /mnt/hdfs) on stat1005, apparently causing a huge load [06:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:42] (03CR) 10Giuseppe Lavagetto: [C: 031] "I think we should tune down the number of workers more aggressively on the frontends than on the backends, that could go down and still be" [puppet] - 10https://gerrit.wikimedia.org/r/421860 (https://phabricator.wikimedia.org/T184561) (owner: 10Filippo Giunchedi) [06:08:19] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Software caused connection abort [06:08:51] !log force kill of fuse_dfs (handling /mnt/hdfs) on stat1004, apparently causing a huge load [06:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:09] RECOVERY - Disk space on stat1004 is OK: DISK OK [06:11:35] !log Deploy schema change on s7 primary master (db1062) - T190780 [06:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:41] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [06:12:36] apergos: o/ [06:13:41] so /mnt/hdfs was causing a load spike on stat100[45], and running manually the check froze my shell [06:13:56] the alert was listed as UNKNOWN on icinga [06:14:30] I suspect that this might be due to the restarts of the Hadoop HDFS Namenodes that I did yesterday [06:14:34] timing more or less matches [06:16:04] weird thing is that on an1003 the /mnt/hdfs mountpoint didn't cause any mess [06:16:14] did you get rsync failures via email? [06:23:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please amend the comment in the patch." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [06:24:17] !log Deploy schema change on s1 primary master (db1052) - T190780 [06:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:23] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [06:27:46] !log Deploy schema change on s3 codfw master (db2043) - this will generate lag on s3 codfw -T190780 [06:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:40] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:31:39] (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425749 (https://phabricator.wikimedia.org/T190780) [06:33:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425749 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [06:34:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425749 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [06:35:19] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 73920.04219409282 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:36:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 for alter table - T190780 (duration: 01m 18s) [06:36:19] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:24] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [06:36:59] RECOVERY - MariaDB Slave Lag: s8 on db2081 is OK: OK slave_sql_lag Replication lag: 0.16 seconds [06:39:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425749 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [06:42:13] !log Deploy schema change on db1072 (sanitarium master for s3) - this will generate lag on s3 labsdb - T190780 [06:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:19] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [06:42:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425751 [06:44:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425751 (owner: 10Marostegui) [06:45:58] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425751 (owner: 10Marostegui) [06:48:07] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072 after alter table - T190780 (duration: 01m 16s) [06:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:13] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [06:48:22] (03CR) 10Jcrespo: "[08:43] <_joe_> confctl --object-type mwconfig tags 'scope=common,name=WMFMasterDatacenter' --action get all" [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [06:49:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425752 (https://phabricator.wikimedia.org/T190780) [06:49:27] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "The ServerAlias *.wikimedia.org means that any subsite matching it will be processed by that virtualhost, if they haven't been managed by " [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) (owner: 10EddieGP) [06:50:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425752 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [06:51:00] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425751 (owner: 10Marostegui) [06:51:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425752 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [06:53:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1077 for alter table - T190780 (duration: 01m 17s) [06:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:14] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [06:53:22] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425753 [06:54:44] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425753 (owner: 10Marostegui) [06:55:13] (03CR) 10Giuseppe Lavagetto: [C: 031] "I like the patch as it is; we still obviously lack proper monitoring (beyond just seeing the process is up), but that can be added later." [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [06:55:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425753 (owner: 10Marostegui) [06:56:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425752 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [06:56:36] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425753 (owner: 10Marostegui) [06:57:30] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1077 after alter table - T190780 (duration: 01m 17s) [06:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:40] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:21] (03CR) 10Jcrespo: "Alternatively, but probably a bad idea because a self-dependency, https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&format=jso" [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [07:04:52] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425754 (https://phabricator.wikimedia.org/T190780) [07:08:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425754 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [07:10:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425754 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [07:10:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425754 (https://phabricator.wikimedia.org/T190780) (owner: 10Marostegui) [07:11:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1078 for alter table - T190780 (duration: 01m 17s) [07:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:03] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [07:12:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425756 [07:14:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425756 (owner: 10Marostegui) [07:16:06] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425756 (owner: 10Marostegui) [07:16:21] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425756 (owner: 10Marostegui) [07:17:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1078 after alter table - T190780 (duration: 01m 16s) [07:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:02] T190780: Schema changes to site_stats - https://phabricator.wikimedia.org/T190780 [07:28:01] elukey: sorry, I didn't notice the ping! yes I got emails all right, heh [07:29:03] :) [07:29:38] so fuse_dfs was eating a ton of memory, and I haven't thought about stracing / attaching gdb to it (need to remember next time) [07:29:53] timeout sadly doesn't work [07:30:29] I noticed the issue because icinga was showing a UNKNOWN [07:30:44] not sure if it is possible to alarm even for that condition [07:32:13] ugh [07:39:33] (03CR) 10EddieGP: "I'm aware what the wildcard does. I don't think getting rid of it is a bad thing per se. It'd just be foolish to do so without checking if" [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) (owner: 10EddieGP) [07:40:01] !log enabling production traffic for mw1265 [07:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:29] (03PS2) 10Muehlenhoff: Remove profile::beta::icu57 [puppet] - 10https://gerrit.wikimedia.org/r/425032 (owner: 10EddieGP) [07:41:46] (03CR) 10Muehlenhoff: [C: 032] Remove profile::beta::icu57 [puppet] - 10https://gerrit.wikimedia.org/r/425032 (owner: 10EddieGP) [07:42:00] !log reimage es1012 [07:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:14] eddiegp: thanks for your patch, now merged [07:43:42] (03PS2) 10Jcrespo: Revert "mariadb: Allow reimage of all es2*** hosts to stretch" [puppet] - 10https://gerrit.wikimedia.org/r/425492 [07:44:19] the reimage will have to wait, I forgot to deploy a patch [07:45:09] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Allow reimage of all es2*** hosts to stretch" [puppet] - 10https://gerrit.wikimedia.org/r/425492 (owner: 10Jcrespo) [07:46:00] moritzm: Thanks! :) [07:46:29] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:47:00] uh [07:50:48] !log Drop table linkscc from s4,s5 and s6 [07:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:29] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:52:17] (03PS1) 10Jcrespo: mariadb: Change host allowed to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/425765 [07:54:20] (03CR) 10Jcrespo: [C: 032] mariadb: Change host allowed to be reimaged [puppet] - 10https://gerrit.wikimedia.org/r/425765 (owner: 10Jcrespo) [07:54:27] (03PS1) 10Jcrespo: mariadb: Disable reimaging on all db hosts by default [puppet] - 10https://gerrit.wikimedia.org/r/425766 [07:55:42] !log Drop table linkscc from s2 and s7 [07:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:43] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425767 [07:58:50] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425767 [08:00:21] (03PS1) 10Jcrespo: mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425769 [08:11:26] !log Drop table linkscc from s1 [08:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:45] !log Drop table linkscc from s3 codfw primary master [08:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:21] (03PS2) 10Muehlenhoff: Disable PrivateTmp via systemd override for stretch-based app servers [puppet] - 10https://gerrit.wikimedia.org/r/425509 (https://phabricator.wikimedia.org/T185195) [08:34:40] (03PS3) 10Muehlenhoff: Disable PrivateTmp via systemd override for stretch-based app servers [puppet] - 10https://gerrit.wikimedia.org/r/425509 (https://phabricator.wikimedia.org/T185195) [08:35:56] (03CR) 10Muehlenhoff: [C: 032] Disable PrivateTmp via systemd override for stretch-based app servers [puppet] - 10https://gerrit.wikimedia.org/r/425509 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [08:41:04] 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4126226 (10Marostegui) a:03Papaul [08:41:06] 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123236 (10Marostegui) p:05Triage>03Normal [08:43:14] (03CR) 10Volans: "I'm not convinced this is the right solution, at least not this alone." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422373 (owner: 10Giuseppe Lavagetto) [08:44:25] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4126233 (10MoritzMuehlenhoff) Is the localisation cache currently generated incompletey as a consequence of that bug? I reimaged an app server with str... [08:53:13] (03PS1) 10Muehlenhoff: Reimage mw1279 (API canary) with stretch [puppet] - 10https://gerrit.wikimedia.org/r/425772 (https://phabricator.wikimedia.org/T174431) [08:54:09] (03CR) 10Muehlenhoff: [C: 032] Reimage mw1279 (API canary) with stretch [puppet] - 10https://gerrit.wikimedia.org/r/425772 (https://phabricator.wikimedia.org/T174431) (owner: 10Muehlenhoff) [08:54:56] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4126243 (10Tarrow) [08:55:50] (03PS1) 10ArielGlenn: remove ms1001 from dump rsync targets [puppet] - 10https://gerrit.wikimedia.org/r/425773 (https://phabricator.wikimedia.org/T182540) [09:04:56] (03CR) 10ArielGlenn: [C: 032] remove ms1001 from dump rsync targets [puppet] - 10https://gerrit.wikimedia.org/r/425773 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [09:05:37] (03PS2) 10ArielGlenn: remove ms1001 from dump rsync targets [puppet] - 10https://gerrit.wikimedia.org/r/425773 (https://phabricator.wikimedia.org/T182540) [09:10:31] (03PS1) 10Vgutierrez: install_server: Reimage pybal-test2002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425775 (https://phabricator.wikimedia.org/T190993) [09:20:16] (03PS1) 10ArielGlenn: Turn off incoming rsyncs of various datasets to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/425776 (https://phabricator.wikimedia.org/T182540) [09:22:13] (03CR) 10ArielGlenn: [C: 032] Turn off incoming rsyncs of various datasets to dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/425776 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [09:37:21] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [09:37:42] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [09:40:01] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [09:40:18] ema, vgutierrez ^^^ [09:41:46] the spike seems recovered though [09:43:02] and there is a maintenance window, so probably related to the lost of connectivity and re-routing time [09:48:02] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [09:51:38] (03PS1) 10Jcrespo: install_server: Fix probable netboot bug introduced on earlier commit [puppet] - 10https://gerrit.wikimedia.org/r/425779 [09:53:02] (03Abandoned) 10Jcrespo: mariadb: Disable reimaging on all db hosts by default [puppet] - 10https://gerrit.wikimedia.org/r/425766 (owner: 10Jcrespo) [09:53:16] (03Abandoned) 10Jcrespo: install_server: Fix probable netboot bug introduced on earlier commit [puppet] - 10https://gerrit.wikimedia.org/r/425779 (owner: 10Jcrespo) [09:54:14] (03PS1) 10Jcrespo: install_server: Fix probable netboot bug introduced on earlier commit [puppet] - 10https://gerrit.wikimedia.org/r/425780 [09:55:17] (03CR) 10Jcrespo: [C: 032] install_server: Fix probable netboot bug introduced on earlier commit [puppet] - 10https://gerrit.wikimedia.org/r/425780 (owner: 10Jcrespo) [09:58:33] !log reimage es1012, take 2 [09:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:46] XDD [10:05:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Technically fine, minor comments here and there. They are on the commit message but the comment in the manifest is exactly the same." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [10:07:01] 10Operations, 10Puppet, 10Patch-For-Review: uwsgi::app sorts config keys, but the .ini file behavior depends on order - https://phabricator.wikimedia.org/T191648#4126403 (10akosiaris) >>! In T191648#4124737, @Andrew wrote: >> Fixing this looks to be as easy as passing $service_settings => '--die-on-term'in o... [10:11:02] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy [10:15:40] (03PS1) 10Ema: pybal::monitoring: install libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/425782 (https://phabricator.wikimedia.org/T177961) [10:16:29] (03CR) 10Alexandros Kosiaris: [C: 031] puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 (https://phabricator.wikimedia.org/T75989) (owner: 10Herron) [10:21:29] (03CR) 10Vgutierrez: [C: 031] pybal::monitoring: install libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/425782 (https://phabricator.wikimedia.org/T177961) (owner: 10Ema) [10:25:05] perl alert?? :D [10:26:15] (03CR) 10Ema: [C: 032] pybal::monitoring: install libwww-perl [puppet] - 10https://gerrit.wikimedia.org/r/425782 (https://phabricator.wikimedia.org/T177961) (owner: 10Ema) [10:26:45] elukey: in Perl we trust [10:35:30] (03CR) 10Vgutierrez: wmf-auto-reimage: bugfix Phabricator client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425495 (owner: 10Volans) [10:37:42] (03CR) 10Volans: "Reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425495 (owner: 10Volans) [10:38:58] (03CR) 10Vgutierrez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/425495 (owner: 10Volans) [10:39:17] thanks! [10:39:18] (03PS2) 10Volans: wmf-auto-reimage: bugfix Phabricator client [puppet] - 10https://gerrit.wikimedia.org/r/425495 [10:40:06] (03CR) 10Volans: [C: 032] wmf-auto-reimage: bugfix Phabricator client [puppet] - 10https://gerrit.wikimedia.org/r/425495 (owner: 10Volans) [10:41:50] (03CR) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [11:02:02] PROBLEM - Disk space on mw1279 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:02:02] PROBLEM - dhclient process on mw1279 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:02:17] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4126493 (10demon) Pretty sure that's unrelated....but I've definitely seen it before. Mostly when running maintenance scripts from tin instead of terbi... [11:03:42] PROBLEM - mediawiki-installation DSH group on mw1279 is CRITICAL: Host mw1279 is not in mediawiki-installation dsh group [11:03:42] PROBLEM - HHVM processes on mw1279 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:11] PROBLEM - High CPU load on API appserver on mw1279 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:22] PROBLEM - nutcracker port on mw1279 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:05:27] (03CR) 10Muehlenhoff: Disable PrivateTmp via systemd override for video scalers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [11:05:32] (03PS5) 10Muehlenhoff: Disable PrivateTmp via systemd override for video scalers [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) [11:05:54] ^ mw1279 is a reimage, silencing [11:07:06] (03PS4) 10Elukey: Swap conf1001 with conf1004 in Zookeeper main-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) [11:14:47] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4126516 (10MoritzMuehlenhoff) >>! In T191921#4126493, @demon wrote: > Pretty sure that's unrelated....but I've definitely seen it before. Mostly when r... [11:15:44] (03PS1) 10Jcrespo: mariadb: Repool es1012 with only a 4.7% of the total traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425787 [11:18:52] RECOVERY - HHVM processes on mw1279 is OK: PROCS OK: 6 processes with command name hhvm [11:19:12] RECOVERY - Disk space on mw1279 is OK: DISK OK [11:19:12] RECOVERY - dhclient process on mw1279 is OK: PROCS OK: 0 processes with command name dhclient [11:19:21] RECOVERY - High CPU load on API appserver on mw1279 is OK: OK - load average: 7.46, 7.26, 4.30 [11:25:32] RECOVERY - nutcracker port on mw1279 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [11:28:31] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4126550 (10demon) No, that should've done it :( [11:32:46] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM, 10User-Elukey, 10User-Joe: Upgrade deployment-prep appserver fleet to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T192071#4126558 (10Joe) p:05Triage>03Normal [11:34:29] <_joe_> uhm moritzm is mw1279 the appserver you just reimaged? [11:35:09] <_joe_> yeah [11:38:45] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/10909/" [puppet] - 10https://gerrit.wikimedia.org/r/425238 (https://phabricator.wikimedia.org/T182924) (owner: 10Elukey) [11:39:02] (03CR) 10Alexandros Kosiaris: [C: 031] Disable PrivateTmp via systemd override for video scalers [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [11:39:28] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4126575 (10MoritzMuehlenhoff) Comparing the freshly installed app server (mw1265) with an existing one (mw1264) also shows that /srv/mediawiki/php-1.31... [11:40:20] _joe_: yeah, it's depooled for reimage, that's just the usual Icinga spam when Icinga recreates the host entry [11:41:50] (03CR) 10Elukey: [C: 031] "Looks good from https://puppet-compiler.wmflabs.org/compiler02/10910/kafka1012.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/425550 (https://phabricator.wikimedia.org/T185136) (owner: 10Ema) [11:50:13] (03Abandoned) 10Dereckson: Enforce 1. Echo 2. GlobalPreferences extensions load order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423809 (https://phabricator.wikimedia.org/T190353) (owner: 10Dereckson) [11:59:34] !log pooling mw1279 for some brief test production traffic [11:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:40] RECOVERY - mediawiki-installation DSH group on mw1279 is OK: OK [12:06:22] jouncebot: next [12:06:22] In 0 hour(s) and 53 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T1300) [12:09:59] (03PS5) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [12:10:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425789 (https://phabricator.wikimedia.org/T191996) [12:10:35] (03CR) 10jerkins-bot: [V: 04-1] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [12:11:01] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es1012 with only a 4.7% of the total traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425787 (owner: 10Jcrespo) [12:12:31] (03Merged) 10jenkins-bot: mariadb: Repool es1012 with only a 4.7% of the total traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425787 (owner: 10Jcrespo) [12:12:45] (03CR) 10jenkins-bot: mariadb: Repool es1012 with only a 4.7% of the total traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425787 (owner: 10Jcrespo) [12:13:10] !log Deploy schema change on s8 dbstore1002 - T187089 T185128 T153182 [12:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:18] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [12:13:18] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [12:13:19] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [12:13:44] (03PS2) 10Marostegui: db-eqiad.php: Depool db1114 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425789 (https://phabricator.wikimedia.org/T191996) [12:14:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1012 with low weight (duration: 01m 19s) [12:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425789 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [12:16:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425789 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [12:17:13] queries to es1012 seem to be flowing normally, with no connection or query errors [12:17:48] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425791 [12:17:51] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425791 [12:18:16] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 from API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425789 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [12:18:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 from API - T191996 (duration: 01m 17s) [12:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:30] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [12:20:47] (03CR) 10Marostegui: [C: 04-2] "Wait for the server to catch up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425791 (owner: 10Marostegui) [12:26:03] (03PS2) 10Vgutierrez: install_server: Reimage pybal-test2002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425775 (https://phabricator.wikimedia.org/T190993) [12:26:53] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage pybal-test2002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425775 (https://phabricator.wikimedia.org/T190993) (owner: 10Vgutierrez) [12:34:17] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/424546 (https://phabricator.wikimedia.org/T135991) [12:36:44] (03PS1) 10QChris: Update SSH key for qchris [puppet] - 10https://gerrit.wikimedia.org/r/425792 [12:39:09] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2002 is CRITICAL: Return code of 255 is out of bounds [12:39:09] PROBLEM - DPKG on pybal-test2002 is CRITICAL: Return code of 255 is out of bounds [12:39:19] PROBLEM - configured eth on pybal-test2002 is CRITICAL: Return code of 255 is out of bounds [12:39:20] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: Return code of 255 is out of bounds [12:39:29] PROBLEM - Check systemd state on pybal-test2002 is CRITICAL: Return code of 255 is out of bounds [12:39:29] PROBLEM - dhclient process on pybal-test2002 is CRITICAL: Return code of 255 is out of bounds [12:39:39] sigh.. that's me :) [12:39:49] PROBLEM - Check size of conntrack table on pybal-test2002 is CRITICAL: Return code of 255 is out of bounds [12:39:50] PROBLEM - Disk space on pybal-test2002 is CRITICAL: Return code of 255 is out of bounds [12:46:39] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for uwsgi-puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/424546 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:48:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425791 (owner: 10Marostegui) [12:50:04] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425791 (owner: 10Marostegui) [12:50:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425791 (owner: 10Marostegui) [12:51:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1109 after alter table (duration: 01m 17s) [12:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:40] PROBLEM - Recursive DNS on 208.80.153.51 is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:54:16] that's labtest-recursor0 ^ [12:56:59] RECOVERY - Check size of conntrack table on pybal-test2002 is OK: OK: nf_conntrack is 0 % full [12:56:59] RECOVERY - Disk space on pybal-test2002 is OK: DISK OK [12:57:10] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2002 is OK: OK ferm input default policy is set [12:57:10] RECOVERY - DPKG on pybal-test2002 is OK: All packages OK [12:57:20] RECOVERY - configured eth on pybal-test2002 is OK: OK - interfaces up [12:57:30] RECOVERY - Check systemd state on pybal-test2002 is OK: OK - running: The system is fully operational [12:57:30] RECOVERY - dhclient process on pybal-test2002 is OK: PROCS OK: 0 processes with command name dhclient [12:58:09] (03PS1) 10Jcrespo: mariadb: Fully repool es1012 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425796 [12:59:19] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:59:31] (03CR) 10Jcrespo: [C: 032] mariadb: Fully repool es1012 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425796 (owner: 10Jcrespo) [13:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T1300). [13:00:05] raynor: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:57] o/ [13:01:16] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1012 with full weight (duration: 01m 17s) [13:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:47] I can SWAT today [13:01:55] (03CR) 10jenkins-bot: mariadb: Fully repool es1012 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425796 (owner: 10Jcrespo) [13:02:02] \o/ [13:02:07] hello zeljkof [13:02:55] hi raynor, I'll ping you in a few minutes when the patch is at mwdebug [13:03:45] (03PS3) 10Zfilipin: Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [13:03:51] sure, thx zeljkof [13:04:46] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [13:06:04] (03Merged) 10jenkins-bot: Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [13:07:27] (03CR) 10jenkins-bot: Enable Page Previews for 10% enwiki anon users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425588 (https://phabricator.wikimedia.org/T189906) (owner: 10Pmiazga) [13:07:46] raynor: the patch is at mwdebug1002, please test and let me know if I can deploy it [13:07:54] (03CR) 10Muehlenhoff: [C: 032] Disable PrivateTmp via systemd override for video scalers [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) (owner: 10Muehlenhoff) [13:08:03] (03PS6) 10Muehlenhoff: Disable PrivateTmp via systemd override for video scalers [puppet] - 10https://gerrit.wikimedia.org/r/423882 (https://phabricator.wikimedia.org/T185195) [13:08:28] on it [13:12:35] (03PS1) 10Jcrespo: mariadb: Depool es1013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425800 [13:12:52] (03PS1) 10Gehel: maps: increment cassandra keyspace to v4 for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/425801 (https://phabricator.wikimedia.org/T191655) [13:13:28] zeljkof, thx - the patch works [13:13:36] could you deploy to production, pleasE? [13:13:52] raynor: deploying [13:14:02] sorry for capital e -> shift is just below my enter key and I double pressed it. Facepalm [13:14:08] (03CR) 10Sbisson: [C: 031] maps: increment cassandra keyspace to v4 for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/425801 (https://phabricator.wikimedia.org/T191655) (owner: 10Gehel) [13:15:12] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:425588|Enable Page Previews for 10% enwiki anon users (T189906)]] (duration: 01m 18s) [13:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:18] T189906: Roll out VirtualPageViews to all Wikipedia wikis - https://phabricator.wikimedia.org/T189906 [13:15:25] raynor: on a new keyboard myself :D and deployed [13:15:29] raynor: please [13:15:32] argh [13:15:33] (03PS1) 10Jcrespo: mariadb-install_server: Allow reimage of es1013 [puppet] - 10https://gerrit.wikimedia.org/r/425802 [13:15:48] please check and thanks for deploying with #releng [13:16:13] (03CR) 10Jcrespo: [C: 032] mariadb-install_server: Allow reimage of es1013 [puppet] - 10https://gerrit.wikimedia.org/r/425802 (owner: 10Jcrespo) [13:16:22] thank you [13:17:59] I checked the production - works as expected [13:18:26] raynor: cool, looks like there was only your commit, swat is done then [13:18:32] do tell when happy so I can continue doing db maintenance earlier [13:18:38] !log EU SWAT finished [13:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:50] thanks, more time for me [13:18:51] jynus: done as far as I am concerned [13:18:52] jynus: I'm happy [13:19:59] RECOVERY - Recursive DNS on 208.80.153.51 is OK: DNS OK: 0.043 seconds response time. www.wikipedia.org returns 208.80.153.224 [13:22:05] !log deploying maps internationalization, including new keyspace and generating new tiles - T191655 [13:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:10] T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655 [13:22:26] !log i18n maps will not be available yet, this is only preliminary work [13:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:21] (03PS2) 10Gehel: maps: increment cassandra keyspace to v4 for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/425801 (https://phabricator.wikimedia.org/T191655) [13:24:52] (03CR) 10Gehel: [C: 032] maps: increment cassandra keyspace to v4 for tilerator [puppet] - 10https://gerrit.wikimedia.org/r/425801 (https://phabricator.wikimedia.org/T191655) (owner: 10Gehel) [13:26:04] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425800 (owner: 10Jcrespo) [13:27:19] (03Merged) 10jenkins-bot: mariadb: Depool es1013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425800 (owner: 10Jcrespo) [13:30:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425806 (https://phabricator.wikimedia.org/T187089) [13:31:26] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1013 (duration: 01m 17s) [13:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:57] !log installing openssl updates [13:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:20] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [13:32:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425806 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [13:32:55] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Upgrade LVS servers to stretch - https://phabricator.wikimedia.org/T177961#4126818 (10Vgutierrez) [13:32:58] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993#4126816 (10Vgutierrez) 05Open>03stalled [13:33:30] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993#4090290 (10Vgutierrez) Let's keep pybal-test2001 as jessie till we don't have any LVS on production running jessie [13:33:43] T190607 ? [13:33:43] T190607: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 [13:33:55] or just maintenance due to that? [13:34:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425806 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [13:34:07] bblack^ [13:34:09] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 84.26 ms [13:34:49] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4126823 (10Mholloway) I am able to access the comment-adding and downtime-scheduling interfaces by following the above instructi... [13:35:37] (03PS4) 10Herron: puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 (https://phabricator.wikimedia.org/T75989) [13:35:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 for alter table (duration: 01m 17s) [13:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:17] (03PS6) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [13:36:40] elukey: all should be well now :) [13:36:59] oh, someone is starting to like puppet here [13:37:11] :-) [13:37:20] jynus: yeah probably re-crashed :/ [13:37:39] I report on ticket, but I guess it gets auto-depooled, so no emergency? [13:37:39] it's been depooled since that ticket, so no impact [13:37:41] ah [13:37:44] better, then [13:37:55] (03CR) 10Herron: [C: 032] puppet-agent: log puppet runs via syslog [puppet] - 10https://gerrit.wikimedia.org/r/425538 (https://phabricator.wikimedia.org/T75989) (owner: 10Herron) [13:37:56] they only get to auto-depool if they manage to shutdown cleanly [13:38:24] although I guess pybal healthchecks would still keep it out of further new connections [13:38:31] but the hard depool is better [13:38:56] (03PS7) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [13:39:53] 10Operations, 10ops-esams, 10Traffic: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607#4078006 (10jcrespo) It probably crashed today at 2018-04-12 13:31:20, hardware logs should be checked. [13:40:03] jynus: living puppet to the max :D [13:40:08] bblack: do you want me to do a quick check to the health logs? [13:40:10] I set a 30d downtime for now [13:40:15] (for icinga to ignore it) [13:40:16] !log dropping leftover keyspace v2 and v5 on maps / eqiad - T191655 [13:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:22] T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655 [13:42:21] !log sbisson@tin Started deploy [tilerator/deploy@46cc948]: Deploying tilerator@i18n everywhere [13:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:13] !log Deploy schema change on db1101:3318 - T187089 T185128 T153182 [13:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:20] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [13:44:20] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [13:44:21] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [13:44:23] (03PS1) 10Elukey: role::analytics_cluster::coordinator: add the eventlogging whitelist [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) [13:44:25] !log sbisson@tin Finished deploy [tilerator/deploy@46cc948]: Deploying tilerator@i18n everywhere (duration: 02m 04s) [13:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:47] (03PS1) 10Vgutierrez: install_server: Reimage lvs2006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425811 (https://phabricator.wikimedia.org/T191897) [13:47:17] !log updated puppet-run script to log using syslog and updated rsyslog config to direct puppet-agent logs to /var/log/puppet.log https://gerrit.wikimedia.org/r/425538 [13:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:20] (03CR) 10Ottomata: role::analytics_cluster::coordinator: add the eventlogging whitelist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) (owner: 10Elukey) [13:48:40] (03CR) 10Ottomata: "Feel free to move forward though, don't let my comment block." [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) (owner: 10Elukey) [13:48:59] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) [13:49:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1114 from API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425815 [13:49:17] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1114 from API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425815 [13:49:25] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:50:20] ottomata: thanks :) [13:51:08] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-wmf-elasticsearch-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425814 (https://phabricator.wikimedia.org/T135991) [13:51:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1114 from API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425815 (owner: 10Marostegui) [13:52:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114 from API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425815 (owner: 10Marostegui) [13:52:52] (03CR) 10jenkins-bot: mariadb: Depool es1013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425800 (owner: 10Jcrespo) [13:52:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425806 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [13:53:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114 from API" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425815 (owner: 10Marostegui) [13:54:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1114 into API - T191996 (duration: 01m 17s) [13:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:12] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [13:56:22] (03CR) 10Elukey: role::analytics_cluster::coordinator: add the eventlogging whitelist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) (owner: 10Elukey) [13:59:09] (03PS3) 10Giuseppe Lavagetto: php: add module for basic installation [puppet] - 10https://gerrit.wikimedia.org/r/425535 [13:59:12] (03PS1) 10Giuseppe Lavagetto: deployment-prep: add deployment-mediawiki-07 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/425821 (https://phabricator.wikimedia.org/T192071) [13:59:13] (03PS1) 10Giuseppe Lavagetto: deployment-prep: make the stretch appserver act like a canary [puppet] - 10https://gerrit.wikimedia.org/r/425822 (https://phabricator.wikimedia.org/T192071) [13:59:39] (03CR) 10Mforns: [C: 031] role::analytics_cluster::coordinator: add the eventlogging whitelist [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) (owner: 10Elukey) [14:02:05] (03PS2) 10Giuseppe Lavagetto: deployment-prep: add deployment-mediawiki-07 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/425821 (https://phabricator.wikimedia.org/T192071) [14:02:13] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Move masters away from codfw C6 - https://phabricator.wikimedia.org/T191193#4126969 (10Papaul) p:05Triage>03Normal [14:02:21] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment-prep: add deployment-mediawiki-07 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/425821 (https://phabricator.wikimedia.org/T192071) (owner: 10Giuseppe Lavagetto) [14:03:15] (03PS2) 10Giuseppe Lavagetto: deployment-prep: make the stretch appserver act like a canary [puppet] - 10https://gerrit.wikimedia.org/r/425822 (https://phabricator.wikimedia.org/T192071) [14:03:20] !log increase change-prop sample rate in dev env to 60% (from 40) -- T186751 [14:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:26] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [14:03:35] (03PS2) 10Ppchelko: Enable EventBus for job events for all the wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425601 (https://phabricator.wikimedia.org/T191464) [14:03:52] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment-prep: make the stretch appserver act like a canary [puppet] - 10https://gerrit.wikimedia.org/r/425822 (https://phabricator.wikimedia.org/T192071) (owner: 10Giuseppe Lavagetto) [14:04:38] (03PS3) 10Mobrovac: Remove wmgDebugJobQueueEventBus config parameter. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko) [14:05:10] * mobrovac taking over tin [14:05:22] (03PS1) 10Ottomata: Blacklist mediawiki.job topics from cross DC main <-> main Kafka mirroring [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) [14:06:34] (03CR) 10Mobrovac: [C: 032] Remove wmgDebugJobQueueEventBus config parameter. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko) [14:07:50] (03Merged) 10jenkins-bot: Remove wmgDebugJobQueueEventBus config parameter. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko) [14:08:04] (03CR) 10jenkins-bot: Remove wmgDebugJobQueueEventBus config parameter. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404888 (owner: 10Ppchelko) [14:11:36] !log pooling mw1265 (app server) temporarily for production traffic [14:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:24] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs2006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425811 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [14:16:33] (03PS2) 10Vgutierrez: install_server: Reimage lvs2006 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425811 (https://phabricator.wikimedia.org/T191897) [14:19:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 from main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425826 (https://phabricator.wikimedia.org/T191996) [14:20:03] marostegui: can you wait 10 mins or so with ^ ? [14:20:09] sure! [14:20:14] just ping me when oy are done [14:20:18] *you [14:21:15] (03CR) 10Mobrovac: [C: 04-1] Blacklist mediawiki.job topics from cross DC main <-> main Kafka mirroring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) (owner: 10Ottomata) [14:21:21] kk thnx [14:21:37] !log Reimage lvs2006 as stretch [14:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:51] ema: hola. want to join call a bit early so we can talk before opera joins? [14:23:51] (03PS2) 10Ppchelko: Switch second bulk of low-traffic jobs for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) [14:24:11] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4127030 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs2006.codfw.wmnet ``` The log can be found in `/var/lo... [14:25:41] (03PS2) 10Lokal Profil: Support prefixed dump types [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) [14:28:30] (03CR) 10Lokal Profil: "@Hoo man Added the nt config for dumps and "live data" (i.e. can be resolved through /entity/Q..." [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [14:30:16] marostegui: please, go on with your patch, we have some things to work on first [14:30:25] thanks! [14:30:37] marostegui: i merged a patch that i haven't synced yet, so don't worry about that, i'll sync it afterwards [14:30:52] and it's a clean-up/no-op anyway [14:30:55] so no harm no foul [14:31:36] cool - thanks! [14:31:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 from main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425826 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [14:31:45] (03PS3) 10Ppchelko: Switch second bulk of low-traffic jobs for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) [14:31:47] 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127041 (10Papaul) @Marostegui is it okay for me to reboot the server? [14:32:12] (03PS4) 10Ppchelko: Switch second bulk of low-traffic jobs for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) [14:32:40] 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127043 (10Marostegui) @Papaul let me double check with @jcrespo as he is/was working with esXXXX servers. [14:33:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 from main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425826 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [14:34:37] 10Operations, 10ops-codfw, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127055 (10jcrespo) Not now, I will have to depool it. Give me 5 minutes. [14:34:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 from main traffic - T191996 (duration: 01m 18s) [14:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:43] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [14:34:52] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4127057 (10MoritzMuehlenhoff) >>! In T191921#4126575, @MoritzMuehlenhoff wrote: > Comparing the freshly installed app server (mw1265) with an existing... [14:35:25] marostegui: am i ok to go now? [14:35:54] yep! [14:35:55] I am done [14:35:58] kk thnx [14:36:06] (03CR) 10Mobrovac: [C: 032] Switch second bulk of low-traffic jobs for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [14:36:18] (03PS5) 10Mobrovac: Switch second bulk of low-traffic jobs for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [14:39:48] (03PS2) 10Muehlenhoff: Update SSH key for qchris [puppet] - 10https://gerrit.wikimedia.org/r/425792 (owner: 10QChris) [14:42:27] 10Operations, 10ops-codfw: lvs2006 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T192082#4127070 (10Vgutierrez) [14:42:30] !log ppchelko@tin Started deploy [cpjobqueue/deploy@85fbd47]: Enable second bulk of low traffic jobs for all wikis T190327 [14:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:36] T190327: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327 [14:43:06] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@85fbd47]: Enable second bulk of low traffic jobs for all wikis T190327 (duration: 00m 35s) [14:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:59] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch the second bulk of low-traffic jobs for all wikis - T190327 (duration: 01m 16s) [14:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:19] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic: lvs2006 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T192082#4127086 (10Vgutierrez) [14:45:41] (03CR) 10Muehlenhoff: [C: 032] Update SSH key for qchris [puppet] - 10https://gerrit.wikimedia.org/r/425792 (owner: 10QChris) [14:46:21] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: No-op: Clean up an unused global var for the EventBus-based JobQueue (duration: 01m 17s) [14:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:34] 10Operations, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127091 (10jcrespo) p:05Normal>03Low a:05Papaul>03jcrespo @Papaul @Marostegui Please don't do anything until it is clear what is the issue. [14:46:50] * mobrovac is done with tin [14:48:23] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#4127110 (10Mholloway) a:05Mholloway>03None [14:48:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 from main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425826 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [14:48:29] (03CR) 10jenkins-bot: Switch second bulk of low-traffic jobs for all wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425544 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [14:50:03] (03PS1) 10Rush: openstack: labcontrol100[34] as jessie [puppet] - 10https://gerrit.wikimedia.org/r/425833 [14:51:18] (03CR) 10Rush: [C: 032] openstack: labcontrol100[34] as jessie [puppet] - 10https://gerrit.wikimedia.org/r/425833 (owner: 10Rush) [14:54:46] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3276700 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` ['labcontrol1003.wikimedia.org'] ``` The log can... [14:55:15] 10Operations, 10DBA: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127152 (10jcrespo) Now that I have a way to test it, we can proceed, depooling: ``` $ ipmitool -I lanplus -H es2013.mgmt.codfw.wmnet -U root -E chassis power status Unable to read password from environment... [14:56:42] (03PS1) 10Jcrespo: mariadb: Depool es2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425835 (https://phabricator.wikimedia.org/T191977) [14:59:44] !log shutting down es2013's mariadb [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:28] (03PS1) 10Vgutierrez: lvs: Fix lvs2006 interface names [puppet] - 10https://gerrit.wikimedia.org/r/425836 (https://phabricator.wikimedia.org/T191897) [15:01:33] (03Abandoned) 10Jcrespo: mariadb: Depool es2013 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425769 (owner: 10Jcrespo) [15:01:55] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425835 (https://phabricator.wikimedia.org/T191977) (owner: 10Jcrespo) [15:03:46] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool es2013 (duration: 01m 17s) [15:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:43] (03CR) 10Vgutierrez: [C: 032] lvs: Fix lvs2006 interface names [puppet] - 10https://gerrit.wikimedia.org/r/425836 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [15:06:05] !log installing django/apache security updates on labmon* [15:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:14] 10Operations, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127179 (10jcrespo) a:05jcrespo>03Papaul @Papaul you are now free to handle the server- it is up, but with all the service down and depooled. I would try the reset I propos... [15:06:29] 10Operations, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127182 (10jcrespo) p:05Low>03Normal [15:06:32] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10913/" [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) (owner: 10Ottomata) [15:06:39] (03PS2) 10Ottomata: Blacklist mediawiki.job topics from cross DC main <-> main Kafka mirroring [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) [15:07:06] (03CR) 10Ottomata: Blacklist mediawiki.job topics from cross DC main <-> main Kafka mirroring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) (owner: 10Ottomata) [15:07:22] 10Operations, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4123236 (10jcrespo) The reset a previous ticket suggested was T191977#4123270 (`racadm reset`) [15:08:36] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4127185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs2006.codfw.wmnet'] ``` and were **ALL** successful. [15:08:55] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10914/" [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) (owner: 10Ottomata) [15:09:22] (03CR) 10Ottomata: "I won't be working tomorrow, so I'd prefer to wait until monday to merge this, ya ok?" [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) (owner: 10Ottomata) [15:09:57] papaul: if it is not clear on the ticket, now you can do everyhing you want with es2013 (except reimage it)- no rush, though, I will be in a meeting soon [15:10:49] jynus: thanks [15:10:54] (03Abandoned) 10Jcrespo: Revert "mariadb: Depool es1012 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425767 (owner: 10Jcrespo) [15:11:03] (03PS2) 10Elukey: role::analytics_cluster::coordinator: add the eventlogging whitelist [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) [15:13:05] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4127193 (10bearND) @Mholloway I was able to get to the comment changing interface, too. It just said not authorized after I hit... [15:17:22] jynus: marostegui can you assign the task for es2013 to me it is not in ops-codfw [15:18:03] oh, I though I did that [15:18:06] let me see [15:18:25] (03CR) 10Ppchelko: "Sure, we will not be moving on enabling EventBus for private wikis this week anyway, so no rush" [puppet] - 10https://gerrit.wikimedia.org/r/425824 (https://phabricator.wikimedia.org/T192005) (owner: 10Ottomata) [15:18:35] jynus: ok i see it now [15:18:39] thanks [15:19:08] so you are ok with "assigned" please do something, not assinged "please not yet"? [15:19:17] (equals) [15:19:46] oh, I know what it is missing [15:19:57] it is dc-ops, but not codfw one [15:20:12] fixed now [15:20:18] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127209 (10jcrespo) [15:22:50] (03PS3) 10Elukey: role::analytics_cluster::coordinator: add the eventlogging whitelist [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) [15:24:13] * gehel !log rolling restart of elasticsearch cirrus / eqiad for jvm upgrade [15:24:19] * gehel !log rolling restart of elasticsearch cirrus / eqiad for jvm upgrade completed [15:24:45] that was quick :-) [15:25:14] moritzm: actually started yesterday... I just messed up my last !log [15:25:33] that being said, less than 24h for a full cluster restart is not too bad! We're improving! [15:25:41] * gehel thanks volans for cumin! [15:26:06] This one doesn't look like being logged as well, does it? [15:27:03] * volans yw gehel ;) [15:27:10] To me it shows as if you used '/me !log ...'. [15:27:26] !log rolling restart of elasticsearch cirrus / eqiad for jvm upgrade completed [15:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:46] eddiegp: yep, you're right... thanks for keeping your eyes open! [15:27:57] :) [15:28:45] (03CR) 10jenkins-bot: mariadb: Depool es2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425835 (https://phabricator.wikimedia.org/T191977) (owner: 10Jcrespo) [15:40:45] (03PS1) 10Vgutierrez: pybal: Reenable bgp in lvs2006 [puppet] - 10https://gerrit.wikimedia.org/r/425839 (https://phabricator.wikimedia.org/T191897) [15:42:19] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4127256 (10Vgutierrez) [15:49:55] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4127269 (10thcipriani) As suggested in IRC, I ran `perf record` for rebuilding only the English language cdb in beta (since perf requires root). Not s... [15:53:59] (03PS1) 10Papaul: DNS: Remove DNS entries for restbase-test200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/425842 (https://phabricator.wikimedia.org/T187447) [15:55:15] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127273 (10Papaul) 1- Power drain 2- Reset IDRAC 3- Update BIOS from 2.1.7 to 2.7.1 4- Update IDRAC from 2.21 to 2.52 [15:55:24] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4127276 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labcontrol1003.wikimedia.org'] ``` Of which those **FAILED**: ``` ['labcontrol1003.wikimedia.... [15:56:02] jouncebot: refresh [15:56:02] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127277 (10Papaul) a:05Papaul>03jcrespo [15:56:03] I refreshed my knowledge about deployments. [15:56:08] jouncebot: next [15:56:08] In 0 hour(s) and 3 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T1600) [15:56:29] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127278 (10Marostegui) It is still not working: ``` root@neodymium:/home/marostegui# ipmitool -I lanplus -H es2013.mgmt.codfw.wmnet -U root -E chassis power sta... [15:58:07] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4127282 (10Papaul) [16:00:05] godog, moritzm, and _joe_: Dear deployers, time to do the Puppet SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T1600). [16:00:05] eddiegp: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:01:10] o/ [16:01:10] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4127313 (10Mholloway) >>! In T189524#4127193, @bearND wrote: > @Mholloway I was able to get to the comment changing interface, t... [16:01:13] 10Operations, 10Pybal, 10Traffic: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4127315 (10Vgutierrez) p:05Triage>03Low [16:04:26] (03PS2) 10Jdlrobson: Rollout VirtualPageViews (final stage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422589 (https://phabricator.wikimedia.org/T189906) [16:04:29] (03Abandoned) 10Jdlrobson: Rollout VirtualPageViews (final stage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422589 (https://phabricator.wikimedia.org/T189906) (owner: 10Jdlrobson) [16:07:21] !log Reboot es2013 - T191977 [16:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:27] T191977: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977 [16:09:09] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4127338 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` ['labcontrol1003.wikimedia.org'] ``` The log can... [16:10:12] (03CR) 10Vgutierrez: [C: 032] pybal: Reenable bgp in lvs2006 [puppet] - 10https://gerrit.wikimedia.org/r/425839 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [16:10:44] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127346 (10Marostegui) After the reboot  @papaul suggested, it still doesn't work :-( [16:10:51] Anyone time for puppet swat? [16:12:05] (03PS5) 10Nuria: Moving sqooping of mediawiki to the 5th of month [puppet] - 10https://gerrit.wikimedia.org/r/424473 [16:12:17] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4127353 (10Dzahn) Hi, can you both login and then look at the "Logged in as " line showing up in the Icinga web ui and copy/past... [16:12:39] (03CR) 10jerkins-bot: [V: 04-1] Moving sqooping of mediawiki to the 5th of month [puppet] - 10https://gerrit.wikimedia.org/r/424473 (owner: 10Nuria) [16:15:17] (03CR) 10BBlack: [C: 031] wdqs: new wdqs-internal service [dns] - 10https://gerrit.wikimedia.org/r/424587 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [16:15:59] (03PS1) 10Marostegui: db-eqiad.php: Repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425846 (https://phabricator.wikimedia.org/T191996) [16:16:49] (03CR) 10BBlack: [C: 031] wdqs: LVS and conftool configuration for new wdqs-internal service [puppet] - 10https://gerrit.wikimedia.org/r/424599 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [16:17:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425846 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [16:18:19] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4127378 (10bearND) @Dzahn Mine says "Logged in as BearND" (upper case B). [16:19:18] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425846 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [16:21:27] godog, moritzm, and _joe_: puppet swat? [16:21:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1114 to main traffic and depool db1066 for alter table - T191996 (duration: 01m 17s) [16:21:46] !log Deploy schema change on db1066 - T132416 [16:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:49] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [16:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:58] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [16:22:49] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4127423 (10Mholloway) > Logged in as mholloway [16:24:24] (03PS6) 10Elukey: Move sqooping of mediawiki to the 5th of month [puppet] - 10https://gerrit.wikimedia.org/r/424473 (https://phabricator.wikimedia.org/T191645) (owner: 10Nuria) [16:25:14] (03CR) 10Elukey: [C: 032] Move sqooping of mediawiki to the 5th of month [puppet] - 10https://gerrit.wikimedia.org/r/424473 (https://phabricator.wikimedia.org/T191645) (owner: 10Nuria) [16:25:54] eddiegp: they might be busy (and Filippo is out today) so puppet swapt might not happen today :( [16:26:30] if any opsen is available will probably pick them up later on [16:26:31] I can do it [16:26:37] there you go :) [16:26:51] :) [16:26:59] I reviewed the 2 changes before [16:27:41] elukey: remember this is not supposed to be a think only for filippo, joe or moritz, but for all ops [16:28:26] (03PS5) 10Jcrespo: beta: Combine commons, deployments, meta and zero vhost [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [16:28:49] eddiegp: being beta, I was going to just merge, but need someone to test [16:29:17] jynus: yep I know, and I asked other people to help them (I also help sometimes) but it might happen that all people are busy and puppet swat gets delayed by a couple of days [16:29:19] Yeah, I'll test it. [16:29:27] cool then [16:29:44] I was in a meeting [16:29:49] but I already finished [16:29:58] i am on meetings too [16:30:04] :) [16:30:20] but you can do both at the same time, I cannot! [16:30:21] (I'd have tentatively tried to merge them later on) [16:30:34] (03CR) 10Jcrespo: [C: 032] beta: Combine commons, deployments, meta and zero vhost [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [16:31:34] eddiegp: can you run puppet on the app servers? [16:31:39] Yes. [16:31:44] on the beta ones, of course [16:32:18] I suppose puppet will reload apache automatically and we can test they still work [16:32:24] 10Operations: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4127454 (10Dzahn) [16:32:35] 10Operations: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092#4127464 (10Dzahn) a:03Dzahn [16:35:04] eddiegp: I get an error, is it reloading or config failed? [16:35:26] Puppet finished right now. [16:35:54] I get an error https://commons.wikimedia.beta.wmflabs.org/ [16:36:07] 503 [16:36:23] I am going to revert, there should be logs on error.log [16:36:25] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4127472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` ['labcontrol1004.wikimedia.org'] ``` The log can... [16:36:47] jynus: I just noticed beta was broken before the deploy. [16:36:54] But feel free to revert anyways. [16:37:03] oh [16:37:12] well, let's revert, anyway [16:37:16] :-) [16:37:30] and deploy once it is up, so we can see it beeing up :-) [16:38:16] (03PS1) 10Jcrespo: Revert "beta: Combine commons, deployments, meta and zero vhost" [puppet] - 10https://gerrit.wikimedia.org/r/425850 [16:38:59] joe is working on updating the appservers in beta to stretch currently. Probably that's the reason. [16:39:20] yeah, other hosts also fail [16:39:27] that in theory hasn't touched [16:39:39] so it is unlikely not that patch [16:39:48] but I am going to undeploy anyway [16:39:48] <_joe_> did I do something? [16:39:58] no, don't worry [16:40:06] <_joe_> I'm in a meeting, sorry [16:40:10] beta apaches are missbehaving [16:40:15] <_joe_> been in meetings since 2 hours now :( [16:40:30] (03CR) 10Jcrespo: [C: 032] Revert "beta: Combine commons, deployments, meta and zero vhost" [puppet] - 10https://gerrit.wikimedia.org/r/425850 (owner: 10Jcrespo) [16:40:38] <_joe_> jynus: if the problem is the stretch host, just revert my patches [16:40:46] shinken says the en.wikipedia.beta page is unavailable for 2 hours now. [16:40:55] yeah, joe confirms [16:41:04] will deploy tomorrow, I promise [16:41:20] <_joe_> eddiegp: I think that's wrong, it was working ~ 2 hours ago when I checked [16:41:55] anyway, I am reverting, please run puppet [16:42:11] Fetching on the puppetmaster ... [16:42:57] running puppet ... [16:43:28] done [16:45:47] we will do that once the servers are ok to begin with [16:46:26] I'm not sure what you mean? [16:46:37] mmmm [16:46:37] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: remote ipmi doesn't work for es2013 - https://phabricator.wikimedia.org/T191977#4127531 (10Volans) 05Open>03Resolved I've fixed it, it was a case of password misalignment, see one of the cases described in T150160, [16:46:39] actually [16:46:46] I think only the redirect fails [16:46:52] https://deployment.wikimedia.beta.wmflabs.org/ fails [16:47:02] but https://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page works [16:47:27] not for me [16:48:07] I think something else is wrong [16:48:56] The creation of mediawiki-{08,09} was logged at 14:25. At 14:34 shinken started saying "503 Backend fetch failed" and kept it's state since then. [16:49:30] If nothing else that's relevant to beta happened at that time, I still think it is somehow related to the new stretch servers. [16:49:51] I think the routing is broken [16:50:01] or maybe I am just seeing a cached version [16:50:15] eddiegp: my proposal is to create: [16:50:29] (03PS1) 10Jcrespo: Revert "Revert "beta: Combine commons, deployments, meta and zero vhost"" [puppet] - 10https://gerrit.wikimedia.org/r/425858 [16:50:45] and deploy tomorrow when beta is in a better state [16:51:00] this will not get forgotten, don't worry, nor you will have to wait a week [16:51:03] I will take care of it [16:51:11] 10Operations, 10Stewards-and-global-tools, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515#4127556 (10Jdforrester-WMF) Tagging for the interest of stewards, especially as they often have to do all the work of... [16:51:13] let's see the other patch [16:51:31] which should be easier because it is on production [16:51:41] (03PS4) 10Jcrespo: Run initSiteStats twice a month [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [16:52:11] That's fine, I wasn't worrying about it :) [16:52:24] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4127562 (10Mholloway) I believe 'mholloway' is the canonical capitalization for me, but for the record I got the same result whe... [16:53:28] (03CR) 10Jcrespo: [C: 032] Run initSiteStats twice a month [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [16:53:51] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425846 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [16:54:55] first run will be early on my sunday :-/ [16:56:19] "Error: Failed to apply catalog: Parameter monthday failed on Cron[initsitestats]: 1,15 is not a valid monthday at /etc/puppet/modules/mediawiki/manifests/maintenance/initsitestats.pp:2" [16:56:32] (03PS1) 10Jcrespo: Revert "Run initSiteStats twice a month" [puppet] - 10https://gerrit.wikimedia.org/r/425861 [16:56:44] we are not lucky today [16:56:46] huh? [16:56:48] :-) [16:57:05] hey, I am just saying what puppet says [16:57:25] I'm not blaming you. [16:57:28] (03CR) 10Jcrespo: [C: 032] Revert "Run initSiteStats twice a month" [puppet] - 10https://gerrit.wikimedia.org/r/425861 (owner: 10Jcrespo) [16:57:38] Sorry if that was the sound of it. :) [16:57:46] PROBLEM - puppet last run on francium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:57:49] I am just joking :-) [16:57:59] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4127591 (10bearND) Same here. Tried with `bearND` but same result. [16:58:25] I'm comparing it to another cron patch of mine. [16:58:44] let me revert (terbium is not a place to keep puppet broken,) and see what is the issue [16:58:53] I think it should be [1, 15] instead of '1,15', but I'll have to check. [16:58:59] Yeah, revert for now. [16:59:01] I was surprised too [16:59:15] eddiegp: there is a very helpful puppet compiler [16:59:25] !log increase change-prop sample rate in dev env to 80% (from 60) -- T186751 [16:59:30] at https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/ [16:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:33] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [16:59:43] jynus: I'd love to have access to that. ;) [16:59:57] Unfortunately only volunteers with NDA have the necessary permission in jenkins. [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:11] It's granted via the ldap group. [17:00:15] (03CR) 10Jcrespo: "This is probably ok, but there are issues on beta that prevent proper testing. Those need to go away first." [puppet] - 10https://gerrit.wikimedia.org/r/425858 (owner: 10Jcrespo) [17:00:28] ah, ok [17:00:33] I can run it, then [17:00:57] (03PS1) 10Jcrespo: Revert "Revert "Run initSiteStats twice a month"" [puppet] - 10https://gerrit.wikimedia.org/r/425863 [17:01:42] I'll upload a new patch. [17:01:45] (running the compiler) [17:02:00] In https://gerrit.wikimedia.org/r/#/c/382631/9/modules/mediawiki/manifests/maintenance/purge_expired_userrights.pp I indeed used an array. [17:02:02] you can ammend the one I re-reverted, too ^ [17:02:51] ah, yes [17:03:11] I am using that anyway to check the error [17:03:55] it doesn't really detect the error, because it is probably taken as a constant, and constants are not checked [17:04:17] so puppet itself is ok [17:04:24] only the value has to be different [17:04:34] (03PS2) 10EddieGP: Revert "Revert "Run initSiteStats twice a month"" [puppet] - 10https://gerrit.wikimedia.org/r/425863 (owner: 10Jcrespo) [17:06:22] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))): 15785.835643564355 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [17:06:28] I don't know what puppet does, I think I have to go to the source code [17:06:42] it is not documented except for scalar values [17:06:51] https://puppet.com/docs/puppet/5.3/types/cron.html#cron-attribute-monthday [17:07:22] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1002.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:08:34] eddiegp: it is either your format or ['1-15'] [17:08:40] none make sense to me [17:09:50] eddiegp: https://groups.google.com/forum/#!topic/puppet-users/nDDYgZXga9w same error [17:09:51] jynus: Grepping for mine shows we're using it in multiple manifests already. [17:10:00] And I've never seen yours before. [17:10:12] suggested there is ['1', '15'] [17:10:21] which I think is the same? [17:10:53] Both are lists, the one has the numbers as integers, the other as strings. Don't know if that matters. [17:11:06] probably not [17:11:15] but also, it is puppet, who knows [17:11:17] All other occurences in our manifests use ints, so I'd just go with that :) [17:11:23] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4127666 (10Niedzielski) This is still an issue: ``` Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 236099863 Error: 503, Backend... [17:11:25] (03CR) 10Jcrespo: [C: 032] Revert "Revert "Run initSiteStats twice a month"" [puppet] - 10https://gerrit.wikimedia.org/r/425863 (owner: 10Jcrespo) [17:12:18] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4023631 (10EddieGP) All of beta is currently down. [17:13:37] eddiegp: it worked now, let me check the output [17:14:25] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4127673 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labcontrol1003.wikimedia.org'] ``` Of which those **FAILED**: ``` ['labcontrol1003.wikimedia.... [17:15:02] crontab -u $web_user -l | grep -i initsitestats [17:15:14] # Puppet Name: initsitestats [17:15:15] 39 5 1,15 * * /usr/local/bin/foreachwiki initSiteStats.php --update > /dev/null [17:15:22] so there you go [17:15:26] sorry for the problems [17:15:26] Nice :) [17:15:33] Thank you! [17:15:41] someone should have a look at beta [17:15:46] Now we just have to hope it doesn't break on your weekend ;) [17:15:47] if you are around tomorrow [17:15:53] we can merge the other [17:16:02] I'll do that now. [17:16:08] (looking at beta that is) [17:16:21] Yeah, I'll be here most of the day tomorrow. [17:16:44] I will keep an eye on the first run, as I will be the most affected [17:17:02] !log installing patch security updates on trusty [17:17:04] but given that it is using vslow, it shouldn't be an issue in the worst case scenario [17:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:44] We've run that script manually not so far ago. [17:18:04] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4127679 (10jcrespo) Giuseppe mentioned some test stretch patches on beta, it may be unrelated, but so he is aware of ongoing issues. [17:18:07] yeah, but it is all wikis, etc. [17:18:12] you never know [17:18:25] there is no one monitoring, etc. [17:18:39] there is enwiki running another related to collation [17:19:06] as a SRE/DBA, I have to consider the worse case scenario :-) [17:19:18] and if I was too concerned, I would have blocked [17:19:23] it [17:22:43] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:26:20] "you never know" is of course true and I'd never argue against that, but there are patches more likely to break things and those less likely to do so. This one I'd sort in the latter category. :) [17:27:14] !log installing apache security updates on krypton [17:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:45] !log temporarily disabling puppet agents for openssl updates and apache restarts on puppet masters [17:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:12] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#4127719 (10aaron) [17:35:28] !log installing apache security updates on hafnium [17:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:26] !log puppet master updates complete — re-enabling puppet agents [17:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:38] (03Abandoned) 10Dduvall: ci: Host helm charts at integration.wikimedia.org/charts [puppet] - 10https://gerrit.wikimedia.org/r/425105 (https://phabricator.wikimedia.org/T191821) (owner: 10Dduvall) [17:38:58] (03PS1) 10Dduvall: Host deployment charts at releases.wikimedia.org/charts [puppet] - 10https://gerrit.wikimedia.org/r/425870 (https://phabricator.wikimedia.org/T191821) [17:41:40] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4127755 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labcontrol1004.wikimedia.org'] ``` Of which those **FAILED**: ``` ['labcontrol1004.wikimedia.... [17:45:44] any reason not to deploy Parsoid right now? [17:47:32] !log arlolra@tin Started deploy [parsoid/deploy@1807a38]: Updating Parsoid to 322b6e8 [17:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:31] (03CR) 10Dzahn: "yep, +1 for using releases. as recommended in today's Service Operations team meeting. i can take this and deploy it." [puppet] - 10https://gerrit.wikimedia.org/r/425870 (https://phabricator.wikimedia.org/T191821) (owner: 10Dduvall) [17:54:22] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1975 bytes in 0.108 second response time [17:55:22] (03PS2) 10Dzahn: Host deployment charts at releases.wikimedia.org/charts [puppet] - 10https://gerrit.wikimedia.org/r/425870 (https://phabricator.wikimedia.org/T191821) (owner: 10Dduvall) [17:56:07] (03CR) 10Dzahn: [C: 032] Host deployment charts at releases.wikimedia.org/charts [puppet] - 10https://gerrit.wikimedia.org/r/425870 (https://phabricator.wikimedia.org/T191821) (owner: 10Dduvall) [18:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:01:43] marxarelli: https://releases.wikimedia.org/charts/ [18:02:37] (03CR) 10Dzahn: [C: 032] "https://releases.wikimedia.org/charts/" [puppet] - 10https://gerrit.wikimedia.org/r/425870 (https://phabricator.wikimedia.org/T191821) (owner: 10Dduvall) [18:02:41] !log arlolra@tin Finished deploy [parsoid/deploy@1807a38]: Updating Parsoid to 322b6e8 (duration: 15m 09s) [18:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:13] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4127819 (10RStallman-legalteam) @Matthias_Geisler_WMDE - I didn't see your email address on the WMDE contact page, so I wanted to double check - is it matthias.geis... [18:15:16] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4127827 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by rush on neodymium.eqiad.wmnet for hosts: ``` ['labcontrol1003.wikimedia.org'] ``` The log can... [18:17:23] (03PS41) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [18:18:04] (03CR) 10jerkins-bot: [V: 04-1] Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [18:19:03] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4127842 (10thcipriani) p:05Triage>03High hrm, everything from load.php is failing. Don't know if this is necessarily deployment-cache-text-04's problem since IIRC that's... [18:20:32] mutante: can I get your sanity check on something pleaes? I made this change https://gerrit.wikimedia.org/r/c/425833/ tryinig to reimage labcontrol1003 as jessie and it booted acting like it was installing and kicked me to an initramfs console. trying again but maybe something simple I'm missing [18:23:30] looking [18:24:16] (03PS42) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [18:24:51] (03CR) 10jerkins-bot: [V: 04-1] Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 (owner: 10Aaron Schulz) [18:25:22] chasemp: the change itself seems good, i dont see an issue there. if you already saw the installer starting and then it kicked you initramfs i rather suspect the partman recipe [18:25:43] chasemp: can you get to installer log files from that console [18:28:26] mutante: I expected it was the same from before. mutante yeah maybe so, it's 'reinstalling' again now. seems to be pointed at raid10-gpt-srv-lvm-ext4.cfg [18:28:33] which is not the same as labcontrol100[12] [18:28:47] so that's interesting [18:28:48] "The output will appear in /var/log/syslog, which can most easily be viewed by starting the internal webserver from the “Save debug logs” menu option (after the network has been configured). " [18:29:15] so you can try reading them from busybox or do the webserver thing [18:29:20] to see what is failing [18:33:59] mutante: ok let me see if I can make sense of that and I'll hit you back [18:43:13] mutante: https://phabricator.wikimedia.org/P6987 [18:47:21] (03PS43) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [18:49:13] (03Abandoned) 10Andrew Bogott: uwsgi: mangle .ini template to put plugin settings on the top [puppet] - 10https://gerrit.wikimedia.org/r/424638 (https://phabricator.wikimedia.org/T191648) (owner: 10Andrew Bogott) [18:50:46] 10Operations, 10Puppet: deprecate and remove --autoload in uwsgi puppet class - https://phabricator.wikimedia.org/T192102#4127914 (10Andrew) [18:51:37] 10Operations, 10hardware-requests: Reclaim/Decommission (specify) hostname[S] - https://phabricator.wikimedia.org/T192103#4127930 (10Ottomata) [18:51:52] 10Operations, 10hardware-requests: Decommission notebook100[12] - https://phabricator.wikimedia.org/T192103#4127930 (10Ottomata) p:05Triage>03Normal [18:52:05] 10Operations, 10hardware-requests: Decommission notebook100[12] - https://phabricator.wikimedia.org/T192103#4127930 (10Ottomata) [18:52:32] 10Operations, 10hardware-requests: Decommission notebook1001 - https://phabricator.wikimedia.org/T192103#4127930 (10Ottomata) [18:53:00] (03PS44) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [18:55:44] chasemp: eh.. " [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)" doesn't sound good to start with [18:55:51] but maybe that is normal [18:55:56] (03PS1) 10Ottomata: Mark notebook1001 as spare and remove unused paws_internal classes [puppet] - 10https://gerrit.wikimedia.org/r/425878 (https://phabricator.wikimedia.org/T183145) [18:56:25] the actual issue is the "couldn't detect harddisks" though [18:57:32] that's the thing where it needs more rootdelay. installer bug [18:58:05] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10917/notebook1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/425878 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [18:58:08] (03CR) 10Ottomata: [C: 032] Mark notebook1001 as spare and remove unused paws_internal classes [puppet] - 10https://gerrit.wikimedia.org/r/425878 (https://phabricator.wikimedia.org/T183145) (owner: 10Ottomata) [18:58:29] robh: ^ do you have advice how we usually handled that [18:59:05] firmware bug doesnt sound good [18:59:25] doesnt sound normal to me [18:59:34] chasemp: i was thinking it's related to https://phabricator.wikimedia.org/rOPUP7cb3ffa9571cf0876dcfae2e0b4b42f52c99f819 [18:59:39] like that used to be the fix for it [18:59:50] notebook1001 is an hwraid [18:59:51] robh: oh! ok [18:59:56] those typically dont have the rootdelay issue [19:00:02] so it should not even try software raid? [19:00:04] i see [19:00:05] thcipriani: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:06] its an r720xd [19:00:17] with perc 7XX controller [19:00:17] * thcipriani does train things [19:00:32] mutante: well, it seems silly to use software raid when there is a very good hardwar raid [19:00:36] plus its something like 12 disks no? [19:00:39] agree! [19:00:50] luckily its dell [19:00:55] we can setup the raid easily in bios [19:00:59] so what should Chase pick as partman recipe then? [19:01:02] i can do it or i can walk you though it. [19:01:15] well, this just needsa flat filesystem right? [19:01:36] chasemp: ^ [19:01:43] so yeah, i'd drop to the bios [19:01:51] and then check the raid settings in tere and see what is already setup [19:01:57] chances are its already in a raid10 of all the disks [19:02:04] then it matters if it has flexbay disks or not [19:02:20] if it does, then its going to be similar to the systems where the flexbay raid1 has the os [19:02:34] and the raid10 12 disks in front loading hot swap bays are the data partition [19:02:36] default of /srv [19:02:43] 10Operations, 10hardware-requests: Decommission notebook1001 - https://phabricator.wikimedia.org/T192103#4127986 (10Ottomata) [19:02:50] (I may be giving too much info, sorry ;) [19:03:08] im not sure what recipe that is, but it wouldn't be anything with raid in the title. [19:03:17] since all the 'raid' means software raid in partman recipes in our repo [19:03:37] he said "seems to be pointed at raid10-gpt-srv-lvm-ext4.cfg" [19:03:47] that is wrong if its a hardwar raid [19:03:47] so that would be wrong then [19:03:50] ack [19:04:00] thanks for the info, just passing it on to Chase [19:05:55] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4127994 (10thcipriani) p:05High>03Normal Well the deployment-mediawiki-07 backend was the cause of 503s today. I changed the appserver backend in hiera to deployment-medi... [19:06:53] (03PS1) 10BryanDavis: Only warn about empty SWAT deploys [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/425881 [19:07:59] Q/win 29 [19:08:37] (03CR) 10Niharika29: [C: 032] Only warn about empty SWAT deploys [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/425881 (owner: 10BryanDavis) [19:09:44] (03Merged) 10jenkins-bot: Only warn about empty SWAT deploys [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/425881 (owner: 10BryanDavis) [19:12:07] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4128004 (10thcipriani) This is hard to explain. So when deployment-cache-text-04 used deployment-mediawiki-07 as a backend this page was coming back with a 503: https://en.m... [19:15:04] (03PS9) 10Urbanecm: Initial configuration for lfnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400234 (https://phabricator.wikimedia.org/T183561) [19:15:16] (03PS10) 10Urbanecm: Initial configuration for inhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402658 (https://phabricator.wikimedia.org/T184374) [19:20:41] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#4128025 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['labcontrol1003.wikimedia.org'] ``` Of which those **FAILED**: ``` ['labcontrol1003.wikimedia.... [19:29:20] (03PS1) 10Thcipriani: All wikis to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425885 [19:32:49] (03CR) 10Thcipriani: [C: 032] All wikis to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425885 (owner: 10Thcipriani) [19:33:12] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4128044 (10Dzahn) [19:33:35] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4128045 (10Dzahn) [19:34:12] (03Merged) 10jenkins-bot: All wikis to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425885 (owner: 10Thcipriani) [19:34:45] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4126243 (10Dzahn) @RStallman-legalteam And.. here's another request for NDA for a WMDE developer, renamed ticket to keep them apart. [19:35:33] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4128063 (10Dzahn) Afaict here also T191523#4125233 applies so L2 is not actually correct and whereever that template is that these tickets are created from, it s... [19:37:46] !log routing ns0 to baham [19:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:41] !log reboot radon for kernel upgrade T188092 [19:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:39] !log thcipriani@tin rebuilt and synchronized wikiversions files: All wikis to 1.31.0-wmf.29 [19:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:09] (03CR) 10jenkins-bot: All wikis to 1.31.0-wmf.29 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425885 (owner: 10Thcipriani) [19:44:53] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4128080 (10Tarrow) The template is just coded into a link in an internal WMDE onboarding document. I'll let them know it should be changed. Should the first chec... [19:46:59] !log all good, revert routing ns0 to baham [19:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:54] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4128084 (10RStallman-legalteam) @Tarrow - could you give me your full name and email address, either here or to rstallman@wikimedia.org? [19:50:50] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4128085 (10Tarrow) I've sent an email, thanks! [19:51:22] !log routing ns1 to radon [19:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:23] !log reboot baham for kernel upgrade T188092 [19:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:44] PROBLEM - Host ns1-v6 is DOWN: PING CRITICAL - Packet loss = 100% [19:58:45] XioNoX: mmh, v6 not re-routed perhaps? ^ [19:59:00] baham just came back online anyhow [19:59:13] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 36.13 ms [19:59:57] ema: since when there are v6 vips? [20:00:21] there have always been v6 IPs for authdns, we just don't publish them for the world to reference [20:00:37] (our upstream NS records e.g. in the .org servers only know our IPv4 addresses) [20:01:17] good, yeah dig AAAA doesn't return anything for ns1 while A does [20:01:28] right, there was no incoming dns traffic at all when I rebooted [20:01:41] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4128112 (10Dzahn) @Tarrow yea, the first line can be removed. The second line can be amended to say that RStallman-legalteam should be added to do that. The thi... [20:02:04] XioNoX: feel free to route ns1 traffic back to baham [20:02:08] we've discussed turning on v6 authdns before, I don't remember exactly what the past reasons were for stalling on it, much less whether they're still valid, offhand [20:02:23] ema: ok! [20:02:33] (probably lots of reasonably-valid handwaving about relative reliability and latency of v6 in general) [20:03:04] (hrm, and probably also concerns about the possible decrease in geodns accuracy with ipv6 all things considered (including maxmind)) [20:03:14] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [20:03:34] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [20:04:22] !log all good, revert routing ns1 to radon [20:04:24] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1946 bytes in 0.114 second response time [20:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:53] (03PS1) 10Ppchelko: Revert switching the TranslateUpdateJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425889 (https://phabricator.wikimedia.org/T192107) [20:20:35] (03CR) 10Nemo bis: "Thanks for the changeset. Please follow " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425889 (https://phabricator.wikimedia.org/T192107) (owner: 10Ppchelko) [20:21:01] (03PS2) 10Nemo bis: Revert switching the TranslateUpdateJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425889 (https://phabricator.wikimedia.org/T192107) (owner: 10Ppchelko) [20:22:46] (03PS3) 10Ppchelko: Revert switching the TranslateUpdateJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425889 (https://phabricator.wikimedia.org/T192107) [20:23:35] (03PS4) 10Legoktm: Revert switching the TranslateUpdateJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425889 (https://phabricator.wikimedia.org/T192107) (owner: 10Ppchelko) [20:27:45] * mobrovac taking over tin for 10 mins [20:27:49] (03CR) 10Mobrovac: [C: 032] Revert switching the TranslateUpdateJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425889 (https://phabricator.wikimedia.org/T192107) (owner: 10Ppchelko) [20:29:00] (03Merged) 10jenkins-bot: Revert switching the TranslateUpdateJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425889 (https://phabricator.wikimedia.org/T192107) (owner: 10Ppchelko) [20:29:40] (03CR) 10jenkins-bot: Revert switching the TranslateUpdateJob to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425889 (https://phabricator.wikimedia.org/T192107) (owner: 10Ppchelko) [20:32:32] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Switch TranslateUpdateJob back to the Redis-based queue as it is using PHP serialisation - T192107 (duration: 01m 00s) [20:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:38] T192107: Unable to mark pages for translation in Meta - https://phabricator.wikimedia.org/T192107 [20:33:49] !log ppchelko@tin Started deploy [cpjobqueue/deploy@bd772eb]: Revert switching TranslationUpdateJob T192107 [20:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:29] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@bd772eb]: Revert switching TranslationUpdateJob T192107 (duration: 00m 39s) [20:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:55] (03PS3) 10Ottomata: dumps: Add rsync fetch jobs for datasets in stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/423539 (https://phabricator.wikimedia.org/T189283) (owner: 10Madhuvishy) [20:36:04] (03CR) 10Ottomata: [V: 032 C: 032] dumps: Add rsync fetch jobs for datasets in stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/423539 (https://phabricator.wikimedia.org/T189283) (owner: 10Madhuvishy) [20:37:57] !log increase change-prop sample rate in dev env to 100% (from 80) -- T186751 [20:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:04] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [20:39:02] (03PS1) 10Ottomata: Fix typo in stat_dumps jobs [puppet] - 10https://gerrit.wikimedia.org/r/425897 (https://phabricator.wikimedia.org/T189283) [20:39:06] madhuvishy: ya?^ [20:39:21] (03CR) 10Ottomata: [V: 032 C: 032] Fix typo in stat_dumps jobs [puppet] - 10https://gerrit.wikimedia.org/r/425897 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [20:39:53] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:39:54] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:40:01] +1 [20:40:17] Sorry I am walking outside. [20:40:38] (03PS1) 10Ottomata: Use valid minute for stat_dumps fetch job [puppet] - 10https://gerrit.wikimedia.org/r/425898 (https://phabricator.wikimedia.org/T189283) [20:41:04] (03CR) 10Ottomata: [V: 032 C: 032] Use valid minute for stat_dumps fetch job [puppet] - 10https://gerrit.wikimedia.org/r/425898 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [20:44:53] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:44:54] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:03:21] !log temporarily disabling puppet to make (ephemeral) change to GC settings, restbase1010 -- T192112 [21:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:28] T192112: Consider using default JVM G1GC settings in the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T192112 [21:04:07] (03CR) 10Dzahn: [C: 032] DNS: Remove DNS entries for restbase-test200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/425842 (https://phabricator.wikimedia.org/T187447) (owner: 10Papaul) [21:06:59] !log restarting Cassandra, restbase1010 -- T192112 [21:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:09] (03CR) 10Dzahn: [C: 031] "how do you get the actual numbers where it says "*" above?" [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [21:08:42] (03CR) 10Dzahn: [C: 04-1] "it has been said by Giuseppe that php5 should not be used anywhere anymore" [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [21:10:03] PROBLEM - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.114 and port 9042: Connection refused [21:10:04] PROBLEM - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:10:26] ^^^ that's me; just restarts [21:17:04] RECOVERY - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.114 port 9042 [21:17:04] RECOVERY - cassandra-a SSL 10.64.0.114:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-a valid until 2018-08-17 16:11:05 +0000 (expires in 126 days) [21:21:24] PROBLEM - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:21:24] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused [21:22:49] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4128357 (10Papaul) [21:23:36] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#3975465 (10Papaul) [21:24:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 3 others: Decommission restbase-test environment - https://phabricator.wikimedia.org/T186755#4128360 (10Papaul) [21:24:35] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#3975465 (10Papaul) 05Open>03Resolved [21:28:34] RECOVERY - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-b valid until 2018-08-17 16:11:06 +0000 (expires in 126 days) [21:29:24] RECOVERY - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.115 port 9042 [21:37:13] PROBLEM - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:37:44] PROBLEM - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.116 and port 9042: Connection refused [21:43:13] RECOVERY - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-c valid until 2018-08-17 16:11:07 +0000 (expires in 126 days) [21:43:43] RECOVERY - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.116 port 9042 [21:44:05] 10Operations, 10netops: ulsfo<->eqord BGP down - https://phabricator.wikimedia.org/T192114#4128411 (10ayounsi) [21:52:14] PROBLEM - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [21:52:44] PROBLEM - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.116 and port 9042: Connection refused [21:58:14] RECOVERY - cassandra-c SSL 10.64.0.116:7001 on restbase1010 is OK: SSL OK - Certificate restbase1010-c valid until 2018-08-17 16:11:07 +0000 (expires in 126 days) [21:58:44] RECOVERY - cassandra-c CQL 10.64.0.116:9042 on restbase1010 is OK: TCP OK - 0.000 second response time on 10.64.0.116 port 9042 [22:01:37] (03PS1) 10Ottomata: Fix rsync module refernce for stats_dumps [puppet] - 10https://gerrit.wikimedia.org/r/425915 (https://phabricator.wikimedia.org/T189283) [22:01:53] (03CR) 10Ottomata: [V: 032 C: 032] Fix rsync module refernce for stats_dumps [puppet] - 10https://gerrit.wikimedia.org/r/425915 (https://phabricator.wikimedia.org/T189283) (owner: 10Ottomata) [22:03:04] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 74548.59339525281 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:04:04] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:07:52] !log restarting Cassandra, restbase2003 -- T192112 [22:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:00] T192112: Consider using default JVM G1GC settings in the RESTBase Cassandra cluster - https://phabricator.wikimedia.org/T192112 [22:11:13] PROBLEM - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:11:14] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused [22:16:56] (03PS3) 10Dzahn: toolforge: add mr (Marathi) language pack and locale [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [22:17:03] (03CR) 10Dzahn: [C: 032] toolforge: add mr (Marathi) language pack and locale [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [22:17:13] RECOVERY - cassandra-a SSL 10.192.32.134:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-a valid until 2018-08-17 16:11:49 +0000 (expires in 126 days) [22:17:14] RECOVERY - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.134 port 9042 [22:20:53] PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.135 and port 9042: Connection refused [22:20:54] PROBLEM - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:24:15] (03PS2) 10Ayounsi: Puppet: add ping_offload role and profile [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) [22:30:15] (03CR) 10Dzahn: [C: 032] "applied on tools-bastion-02 and tools-exec-1401. no issues. it installed the package, changed the config and then ran locale-gen" [puppet] - 10https://gerrit.wikimedia.org/r/425202 (https://phabricator.wikimedia.org/T191727) (owner: 10BryanDavis) [22:30:32] PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.136 and port 9042: Connection refused [22:31:22] PROBLEM - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:31:51] PROBLEM - dhclient process on labcontrol1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:31:51] PROBLEM - Check size of conntrack table on labcontrol1004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:32:29] (03CR) 10Ayounsi: "Puppet compiler fails with the following error:" [puppet] - 10https://gerrit.wikimedia.org/r/424151 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [22:35:52] RECOVERY - Check size of conntrack table on labcontrol1004 is OK: OK: nf_conntrack is 0 % full [22:35:52] RECOVERY - dhclient process on labcontrol1004 is OK: PROCS OK: 0 processes with command name dhclient [22:36:42] RECOVERY - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.135 port 9042 [22:36:52] 10Operations, 10netops: Enabling graceful-switchover causes core dumps on cr1-codfw - https://phabricator.wikimedia.org/T191371#4128548 (10ayounsi) Juniper's reply: > During the cleanup process, ksyncd will check for public nexthops to make sure that there are no public next hops remaining. If ksyncd finds a... [22:36:55] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4128549 (10thcipriani) restored `deployment-mediawiki-07` as appserver backend. It seems the ferm service is having trouble starting on that machine, so the previous varnish... [22:37:31] RECOVERY - cassandra-c SSL 10.192.32.136:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-c valid until 2018-08-17 16:11:51 +0000 (expires in 126 days) [22:37:41] RECOVERY - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.136 port 9042 [22:38:22] RECOVERY - cassandra-b SSL 10.192.32.135:7001 on restbase2003 is OK: SSL OK - Certificate restbase2003-b valid until 2018-08-17 16:11:50 +0000 (expires in 126 days) [22:55:37] 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4128588 (10Nuria) 05Open>03Resolved [22:59:54] Nothing to SWAT this evening it seems. [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180412T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:38] I'm going to merge no-op tests/ patches from Umherirrender [23:01:02] (03PS2) 10Dereckson: Use namespaced PHPUnit\Framework\TestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421588 (https://phabricator.wikimedia.org/T188166) (owner: 10Umherirrender) [23:01:15] (03CR) 10Dereckson: [C: 032] Use namespaced PHPUnit\Framework\TestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421588 (https://phabricator.wikimedia.org/T188166) (owner: 10Umherirrender) [23:02:02] there is only one it seems [23:02:40] (03Merged) 10jenkins-bot: Use namespaced PHPUnit\Framework\TestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421588 (https://phabricator.wikimedia.org/T188166) (owner: 10Umherirrender) [23:02:55] (03CR) 10jenkins-bot: Use namespaced PHPUnit\Framework\TestCase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421588 (https://phabricator.wikimedia.org/T188166) (owner: 10Umherirrender) [23:09:07] AaronSchulz: ping? [23:09:34] !log dereckson@tin Synchronized tests/: Update PHPUnit tests to use PHPUnit\Framework\TestCase (no-op) (duration: 01m 01s) [23:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:12] 10Operations, 10hardware-requests: eqiad: (2) systems for labvirt expansion (labvirt1023 & labvirt1024) - https://phabricator.wikimedia.org/T192119#4128624 (10bd808) [23:20:57] SWATters, heads-up that I’m doing a limited deployment to one of the ORES nodes, and rolling back when finished. Shouldn’t expect the boat to rock. [23:22:33] !log awight@tin Started deploy [ores/deploy@a5cec53]: Canary ores1001 only: Limited test of git-lfs for ORES [23:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:03] !log awight@tin Finished deploy [ores/deploy@a5cec53]: Canary ores1001 only: Limited test of git-lfs for ORES (duration: 02m 31s) [23:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:12] !log awight@tin Started deploy [ores/deploy@543901a]: Restore ores1001 canary to master branch [23:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:56] mutante: So, I created one last /srv/deployment/ores/venv. [23:31:06] Only on ores1001.eqiad.wmnet, happily! [23:33:36] !log awight@tin Finished deploy [ores/deploy@543901a]: Restore ores1001 canary to master branch (duration: 03m 24s) [23:33:41] (03PS1) 10Hoo man: [WIP] Wikidata JSON dump: Only dump batches of ~400,000 pages at once [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) [23:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:33] (03CR) 10Hoo man: [C: 04-1] "Just a WIP anyway" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425926 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [23:42:34] awight: oh? temporarily? [23:45:03] It was a bad rebase, basically. I’ve reverted the server code to the master branch and it probably shouldn’t happen again :) [23:54:27] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4128736 (10Dzahn) I debugged this by looking at the generated Icinga config directly on the server. I found that i gave you wro... [23:56:16] awight: ok :) [23:56:28] let me know if you need any root help [23:56:39] thanks for your time!