[00:01:41] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [00:04:00] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:06:58] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.2) (duration: 11m 22s) [03:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:20] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 886.81 seconds [04:04:31] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 270.33 seconds [04:32:50] RECOVERY - Maps - OSM synchronization lag - codfw on einsteinium is OK: (C)1.728e+05 ge (W)9e+04 ge 1.637e+04 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [05:15:33] !log Deploy schema change on s6 primary master db1061 - T191519 T188299 T190148 [05:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:39] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:15:39] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:15:39] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:19:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431507 (https://phabricator.wikimedia.org/T190148) [05:21:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431507 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:22:11] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4185481 (10Dzahn) No, it can't be closed since it's not done and tin is still the deployment server. Issue is that checkbox(es)... [05:22:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431507 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:24:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1081 for alter table (duration: 01m 01s) [05:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:05] !log Deploy schema change on db1081 - T191519 T188299 T190148 [05:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:11] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:27:11] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:27:11] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:37:21] !log Unused databases devwikiinternal and rel13testwiki from s3 - T118764 [05:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:25] T118764: Drop old devwikiinternal and rel13testwiki from s3 - https://phabricator.wikimedia.org/T118764 [05:55:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431507 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [06:48:41] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:48:51] PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:49:41] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [06:49:50] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.003 second response time [06:55:52] (03CR) 10Muehlenhoff: [C: 031] network: add mwmaint1001 to network constants [puppet] - 10https://gerrit.wikimedia.org/r/430522 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [07:02:12] !log reimaging mw1327, mw1328, mw1329 (app servers) to stretch [07:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:11] !log reimaging mw1233, mw1234, mw1235 (API servers) to stretch [07:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:10] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:13:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431516 (https://phabricator.wikimedia.org/T193732) [07:15:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431516 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [07:15:21] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:16:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431516 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [07:18:32] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 - T193732 (duration: 00m 59s) [07:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:37] T193732: Decommission db1060 - https://phabricator.wikimedia.org/T193732 [07:19:14] !log reimaging mw1303, mw1304 (job runners) to stretch [07:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431516 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [07:19:51] !log Stop replication in sync on db1060 and db1074 - T193732 [07:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:00] !log Change master from db1102:s2 from db1060 to db1074 [07:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:20] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431519 [07:29:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431519 (owner: 10Marostegui) [07:30:48] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431519 (owner: 10Marostegui) [07:32:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431519 (owner: 10Marostegui) [07:32:22] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1074 - T193732 (duration: 00m 59s) [07:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:26] T193732: Decommission db1060 - https://phabricator.wikimedia.org/T193732 [07:33:13] (03PS1) 10Marostegui: db-eqiad.php: db1074 is now master for sanitarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431520 (https://phabricator.wikimedia.org/T193732) [07:34:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1074 is now master for sanitarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431520 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [07:36:07] (03Merged) 10jenkins-bot: db-eqiad.php: db1074 is now master for sanitarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431520 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [07:37:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: db1074 is now db1102's master - T193732 (duration: 00m 59s) [07:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:41] T193732: Decommission db1060 - https://phabricator.wikimedia.org/T193732 [07:38:09] (03CR) 10jenkins-bot: db-eqiad.php: db1074 is now master for sanitarium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431520 (https://phabricator.wikimedia.org/T193732) (owner: 10Marostegui) [07:38:44] !log restart elasticsearch on relforge for JVM upgrade [07:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:33] (03PS2) 10Gehel: Recommendation API: Migrate to the new WDQS internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) (owner: 10Mobrovac) [07:53:24] !log installing libdatetime-timezone-perl stable update for jessie/stretch [07:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:38] (03CR) 10Addshore: [C: 04-1] "To make sure there are not any undesired things happening here I'll move everything that is currently in InitialiseSettings.php into the l" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) (owner: 10Addshore) [07:59:32] !log restart cassandra on maps-test* for JVM upgrade [07:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:53] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:01:33] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:01:48] (03CR) 10Jcrespo: "None of the mysql-related icinga checks for sanitarium_multiinstance are paging: https://phabricator.wikimedia.org/source/operations-puppe" [puppet] - 10https://gerrit.wikimedia.org/r/430919 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:01:53] PROBLEM - HHVM rendering on mw1234 is CRITICAL: connect to address 10.64.48.69 and port 80: Connection refused [08:02:02] PROBLEM - nutcracker process on mw1233 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:02:02] PROBLEM - nutcracker port on mw1234 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:02:02] PROBLEM - mediawiki-installation DSH group on mw1235 is CRITICAL: Host mw1235 is not in mediawiki-installation dsh group [08:02:02] PROBLEM - HHVM processes on mw1235 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:02:02] ^ reimages, silenced [08:02:12] PROBLEM - Nginx local proxy to apache on mw1304 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:15] (03PS1) 10Elukey: profile::druid::common: set java 8 as default [puppet] - 10https://gerrit.wikimedia.org/r/431522 (https://phabricator.wikimedia.org/T193712) [08:04:17] (03PS2) 10Jcrespo: mariadb: Reenable notifications on several core hosts [puppet] - 10https://gerrit.wikimedia.org/r/430919 (https://phabricator.wikimedia.org/T192979) [08:04:34] !log restart cassandra on maps* for JVM upgrade [08:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:50] (03CR) 10Marostegui: [C: 031] mariadb: Reenable notifications on several core hosts [puppet] - 10https://gerrit.wikimedia.org/r/430919 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:09:14] (03CR) 10Jcrespo: [C: 031] vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [08:10:48] (03CR) 10Marostegui: [C: 031] vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [08:10:52] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on several core hosts [puppet] - 10https://gerrit.wikimedia.org/r/430919 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [08:14:30] (03PS2) 10Marostegui: wiki replicas: depool labsdb1010 for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430942 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [08:17:26] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /osm-intl/9/207/163@1.5x.png (default scaled tile) timed out before a response was received [08:17:34] <_joe_> uh [08:18:12] gehel: not sure if related to your restarts ^ ? [08:18:16] gehel: around? [08:18:21] Ah, godog was faster :) [08:18:27] I'm here, looking [08:18:41] the codfw one paged [08:18:45] (03CR) 10Marostegui: [C: 032] wiki replicas: depool labsdb1010 for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430942 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [08:18:52] (03PS1) 10KartikMistry: WIP: Beta: Use Restbase provided public API instead of CXServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431526 (https://phabricator.wikimedia.org/T163203) [08:19:27] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy [08:19:41] !log Reload haproxy on dbproxy1010 to depool labsdb1010 - https://phabricator.wikimedia.org/T174047 [08:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:57] !log rolling restart of logstash to pick up Java security updates [08:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:21] strange... the codfw restart was completed ~10' ago... [08:20:54] and loosing a single cassandra node should not prevent kartotherian from working... [08:21:04] <_joe_> 82 active problems? eek [08:21:10] <_joe_> oh it's appservers reimaging [08:21:12] RECOVERY - Nginx local proxy to apache on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.011 second response time [08:22:03] (03CR) 10Elukey: [C: 032] profile::druid::common: set java 8 as default [puppet] - 10https://gerrit.wikimedia.org/r/431522 (https://phabricator.wikimedia.org/T193712) (owner: 10Elukey) [08:22:03] (03PS2) 10Elukey: profile::druid::common: set java 8 as default [puppet] - 10https://gerrit.wikimedia.org/r/431522 (https://phabricator.wikimedia.org/T193712) [08:23:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:24:52] 99%-ile response time on maps codfw climbed significantly after the cassandra restart was completed... I'll dig into it a bit more... [08:25:06] ^ memcached is local errors caused by job runner reimages [08:27:19] 10Operations: provide proxysql for stretch, add package to puppet - https://phabricator.wikimedia.org/T193919#4185794 (10jcrespo) I am not sure we should keep the proxies on terbium. I will remove its reference so it doesn't get installed on new hosts until we decide which is the best host to hold it. [08:29:49] (03PS1) 10Ema: prometheus: varnish_thumbnails aggregation rule [puppet] - 10https://gerrit.wikimedia.org/r/431528 (https://phabricator.wikimedia.org/T184942) [08:29:58] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:30:04] !log eqiad-prod: more weight to ms-be104[0-3] - T190081 [08:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:08] T190081: rack/setup/install ms-be104[0-3].eqiad.wmnet - https://phabricator.wikimedia.org/T190081 [08:32:15] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): mcrouter production architecture - https://phabricator.wikimedia.org/T192771#4185799 (10Joe) After some tests: - Option B wouldn't work. Using a different `host:port` combination changes the host_id in consistent hashing so the ke... [08:33:28] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:35:13] <_joe_> uhm [08:35:24] <_joe_> now when such an alarm happens, what should I check? [08:35:42] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893#4185804 (10faidon) I'm not sure if this needs my approval, but if it does, it has it, as long as: - The console data contain PII, so an NDA would be absolutely required with whom... [08:35:53] <_joe_> what are those graphs telling me [08:36:09] the linked dashboard in there, the main dashboard contains a description of the graphs [08:36:15] it is essentially 5xx [08:36:28] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:36:54] <_joe_> godog: right, the main dashboard has the explanation, thanks [08:37:17] <_joe_> interesting that the nginx availability was lower than the varnish one [08:37:24] <_joe_> which was already below 100% [08:37:55] <_joe_> this makes me wonder if we have some timeouts that have an offset between nginx and varnish [08:42:26] (03PS1) 10Jcrespo: tendril: Move cron jobs to dbmonitor, remove proxysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) [08:42:55] heh the calculation might be a little skewed for varnish because 204s in response to purges are counted too iirc, this is evident from the main dashboard where varnish requests have spikes but nginx doesn't [08:42:57] (03CR) 10jerkins-bot: [V: 04-1] tendril: Move cron jobs to dbmonitor, remove proxysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) (owner: 10Jcrespo) [08:44:42] (03CR) 10Filippo Giunchedi: prometheus: varnish_thumbnails aggregation rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431528 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [08:46:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 848.68 seconds [08:50:05] (03CR) 10Gehel: [C: 032] Recommendation API: Migrate to the new WDQS internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) (owner: 10Mobrovac) [08:50:10] (03PS3) 10Gehel: Recommendation API: Migrate to the new WDQS internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) (owner: 10Mobrovac) [08:53:13] (03PS2) 10Jcrespo: tendril: Move cron jobs to dbmonitor, remove proxysql from terbium [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) [08:55:00] (03CR) 10Jcrespo: "This should almost close (as invalid, resolved or rejected) the given tickets." [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) (owner: 10Jcrespo) [08:56:07] !log drain + reimage analytics103[7,8] to Debian Stretch [08:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:41] (03CR) 10Jcrespo: [C: 04-1] "https://gerrit.wikimedia.org/r/431529 should make this obsolete" [puppet] - 10https://gerrit.wikimedia.org/r/403978 (https://phabricator.wikimedia.org/T184797) (owner: 10Dzahn) [08:58:05] !log mobrovac@tin Started restart [recommendation-api/deploy@ac66089]: Use the internal WDQS cluster LVS - T190266 [08:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:09] T190266: Switch the Recommendation API to use the internal WDQS cluster - https://phabricator.wikimedia.org/T190266 [08:59:35] (03PS1) 10ArielGlenn: dump stubs with ascending rev ids per page for certain wikis [puppet] - 10https://gerrit.wikimedia.org/r/431533 (https://phabricator.wikimedia.org/T29112) [09:02:14] !log reimaging mw1275 (T192902) [09:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:18] T192902: Broken memory/CPU on mw1275 - https://phabricator.wikimedia.org/T192902 [09:02:27] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 41.13 seconds [09:05:49] ACKNOWLEDGEMENT - Device not healthy -SMART- on labsdb1004 is CRITICAL: cluster=mysql device=megaraid,6 instance=labsdb1004:9100 job=node site=eqiad Marostegui https://phabricator.wikimedia.org/T194012 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labsdb1004&var-datasource=eqiad%2520prometheus%252Fops [09:05:49] ACKNOWLEDGEMENT - Device not healthy -SMART- on labsdb1005 is CRITICAL: cluster=mysql device=megaraid,8 instance=labsdb1005:9100 job=node site=eqiad Marostegui https://phabricator.wikimedia.org/T194012 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labsdb1005&var-datasource=eqiad%2520prometheus%252Fops [09:10:37] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:13:47] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:15:28] eqsin has been depooled since Friday because of router issues ^ [09:15:45] T193897 [09:15:45] T193897: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897 [09:19:35] <_joe_> ema: oh I see [09:19:39] <_joe_> sigh [09:20:43] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431537 [09:21:47] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431537 [09:23:11] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431537 (owner: 10Marostegui) [09:24:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431537 (owner: 10Marostegui) [09:25:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1081 after alter table (duration: 01m 02s) [09:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431539 (https://phabricator.wikimedia.org/T190148) [09:27:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431539 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:28:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431539 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [09:29:43] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431537 (owner: 10Marostegui) [09:30:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1084 for alter table (duration: 01m 03s) [09:30:16] !log Deploy schema change on db1084 - T191519 T188299 T190148 [09:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:23] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [09:30:23] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [09:30:23] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [09:30:35] !log cp-text/upload: start varnish upgrades to 5.1.3-1wm8 T192368 [09:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:40] T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 [09:33:54] (03PS1) 10Marostegui: mariadb: Enable innodb_strict on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/431540 (https://phabricator.wikimedia.org/T150949) [09:35:03] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:40:39] heh so all of these for eqsin are now false positives since eqsin is depooled, we'll have to do the alert per-site instead [09:41:04] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:41:11] I'll send the change shortly [09:41:42] !log Manually enable innodb_strict_mode on db1084 - T150949 [09:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:46] T150949: Set barracuda InnoDB file format as the default configuration everywhere - https://phabricator.wikimedia.org/T150949 [09:42:50] godog: thanks. In a ideal world, we should even be able to downtime a whole DC :) [09:43:13] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:43:22] indeed, that'd be very cool [09:46:29] !log rolling restart of Kibana logstash nodes to pick up Java security updates [09:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:44] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:52:05] !log stop graphite cassandra-metrics-collector on aqs* (touch /etc/cassandra-metrics-collector/disable) [09:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:29] !log reimaging mw1305, mw1306 (job runners) to stretch [09:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:44] (03CR) 10Jcrespo: [C: 031] mariadb: Enable innodb_strict on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/431540 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [09:56:00] (03CR) 10Marostegui: [C: 032] mariadb: Enable innodb_strict on a few roles [puppet] - 10https://gerrit.wikimedia.org/r/431540 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [09:59:01] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 4.02, 4.74, 3.54 [09:59:11] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 3.97, 4.70, 3.51 [10:00:20] RECOVERY - HHVM processes on mw1235 is OK: PROCS OK: 6 processes with command name hhvm [10:00:49] (03PS1) 10Filippo Giunchedi: prometheus: alert on per-site HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/431542 (https://phabricator.wikimedia.org/T186069) [10:01:27] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431543 (https://phabricator.wikimedia.org/T128546) [10:01:50] PROBLEM - Apache HTTP on mw1233 is CRITICAL: connect to address 10.64.48.68 and port 80: Connection refused [10:01:51] PROBLEM - HHVM rendering on mw1233 is CRITICAL: connect to address 10.64.48.68 and port 80: Connection refused [10:01:51] PROBLEM - Check systemd state on mw1234 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:01:51] PROBLEM - nutcracker port on mw1235 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [10:02:00] PROBLEM - Check systemd state on mw1233 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:00] PROBLEM - Check systemd state on mw1235 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:02:01] (03CR) 10jerkins-bot: [V: 04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431543 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:02:10] PROBLEM - Apache HTTP on mw1234 is CRITICAL: connect to address 10.64.48.69 and port 80: Connection refused [10:02:11] PROBLEM - nutcracker port on mw1233 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [10:02:11] PROBLEM - HHVM rendering on mw1234 is CRITICAL: connect to address 10.64.48.69 and port 80: Connection refused [10:02:20] PROBLEM - nutcracker port on mw1234 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [10:02:20] PROBLEM - HHVM rendering on mw1235 is CRITICAL: connect to address 10.64.48.70 and port 80: Connection refused [10:02:20] PROBLEM - Nginx local proxy to apache on mw1235 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.008 second response time [10:02:21] PROBLEM - nutcracker process on mw1235 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [10:02:30] PROBLEM - nutcracker process on mw1234 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [10:02:40] PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.008 second response time [10:02:41] PROBLEM - Apache HTTP on mw1235 is CRITICAL: connect to address 10.64.48.70 and port 80: Connection refused [10:03:02] (03CR) 10Jdrewniak: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431543 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:03:45] those are reimages --^ (see SAL) [10:04:20] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431543 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:05:00] RECOVERY - Check systemd state on mw1233 is OK: OK - running: The system is fully operational [10:05:20] RECOVERY - nutcracker port on mw1233 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:05:21] RECOVERY - nutcracker process on mw1233 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [10:05:27] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431543 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:05:30] RECOVERY - nutcracker process on mw1235 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [10:05:31] RECOVERY - Nginx local proxy to apache on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 9.243 second response time [10:05:50] RECOVERY - Apache HTTP on mw1235 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.042 second response time [10:05:51] RECOVERY - Apache HTTP on mw1233 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.079 second response time [10:06:00] RECOVERY - nutcracker port on mw1235 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:06:01] RECOVERY - Check systemd state on mw1235 is OK: OK - running: The system is fully operational [10:06:17] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/430082 (owner: 10Alexandros Kosiaris) [10:06:28] massive subscription spamming to wikimedia mailing lists going on from aol.com addresses [10:07:10] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 76777 bytes in 8.843 second response time [10:07:21] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 76775 bytes in 0.637 second response time [10:07:59] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#4185940 (10MoritzMuehlenhoff) [10:08:02] 10Operations, 10Patch-For-Review: spare/unused disks on application servers - https://phabricator.wikimedia.org/T106381#4185937 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff All our current mediawiki servers have two disks and all partman recipes were switched to use mw-raid1-lvm.cfg (and... [10:09:20] RECOVERY - Apache HTTP on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 5.130 second response time [10:09:28] !log jdrewniak@tin Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:431543|Bumping portals to master (T128546)]] (duration: 01m 00s) [10:09:30] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 76777 bytes in 8.650 second response time [10:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:35] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:09:50] RECOVERY - Nginx local proxy to apache on mw1234 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.738 second response time [10:10:28] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:431543|Bumping portals to master (T128546)]] (duration: 00m 59s) [10:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:19] (03CR) 10Alexandros Kosiaris: [C: 032] prometheus: Add buddyinfo collector to node exporter [puppet] - 10https://gerrit.wikimedia.org/r/430082 (owner: 10Alexandros Kosiaris) [10:12:23] (03PS2) 10Alexandros Kosiaris: prometheus: Add buddyinfo collector to node exporter [puppet] - 10https://gerrit.wikimedia.org/r/430082 [10:12:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] prometheus: Add buddyinfo collector to node exporter [puppet] - 10https://gerrit.wikimedia.org/r/430082 (owner: 10Alexandros Kosiaris) [10:12:51] (03PS2) 10Filippo Giunchedi: prometheus: alert on per-site HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/431542 (https://phabricator.wikimedia.org/T186069) [10:13:20] akosiaris: \o/ thanks for looking at the number of metrics too [10:13:31] RECOVERY - nutcracker port on mw1234 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:13:41] RECOVERY - nutcracker process on mw1234 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [10:13:50] (03PS2) 10ArielGlenn: dump stubs with ascending rev ids per page for certain wikis [puppet] - 10https://gerrit.wikimedia.org/r/431533 (https://phabricator.wikimedia.org/T29112) [10:14:10] RECOVERY - Check systemd state on mw1234 is OK: OK - running: The system is fully operational [10:14:23] (03CR) 10ArielGlenn: [C: 032] dump stubs with ascending rev ids per page for certain wikis [puppet] - 10https://gerrit.wikimedia.org/r/431533 (https://phabricator.wikimedia.org/T29112) (owner: 10ArielGlenn) [10:14:52] godog: :-) [10:14:55] merged [10:15:26] (03PS2) 10ArielGlenn: remove production roles from ms1001, dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/429728 (https://phabricator.wikimedia.org/T182540) [10:16:30] (03CR) 10ArielGlenn: [C: 032] remove production roles from ms1001, dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/429728 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [10:16:31] RECOVERY - nutcracker process on mw1275 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [10:16:51] RECOVERY - Check systemd state on mw1275 is OK: OK - running: The system is fully operational [10:17:00] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [10:17:10] RECOVERY - nutcracker port on mw1275 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:18:30] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/11127/" [puppet] - 10https://gerrit.wikimedia.org/r/431542 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [10:34:25] (03PS1) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) [10:35:07] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:35:43] ^reimages, silencing [10:35:45] (03CR) 10Ema: prometheus: alert on per-site HTTP availability (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431542 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [10:37:06] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [10:37:28] (03PS2) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) [10:50:49] !log depooled service nginx for mw1221-mw1231 (API servers) [10:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:06] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [10:54:50] (03PS1) 10Sbisson: Bump kartotherian_storage_id to v4 [puppet] - 10https://gerrit.wikimedia.org/r/431550 (https://phabricator.wikimedia.org/T191655) [10:55:13] (03PS1) 10ArielGlenn: remove a bunch of manifests only used on dataset1001, ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/431551 (https://phabricator.wikimedia.org/T182540) [10:58:27] 10Operations, 10ops-eqiad: Broken memory/CPU on mw1275 - https://phabricator.wikimedia.org/T192902#4186188 (10MoritzMuehlenhoff) 05Open>03Resolved The server has been reimaged and is currently serving production traffic without any issues, closing. [10:59:40] !log reimaging mw1330, mw1331, mw1332 (app servers) to stretch [10:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:06] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180507T1100). [11:01:37] (03CR) 10ArielGlenn: [C: 032] remove a bunch of manifests only used on dataset1001, ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/431551 (https://phabricator.wikimedia.org/T182540) (owner: 10ArielGlenn) [11:04:37] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for clamav-freshclam [puppet] - 10https://gerrit.wikimedia.org/r/427916 (https://phabricator.wikimedia.org/T135991) [11:05:07] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for smartd [puppet] - 10https://gerrit.wikimedia.org/r/419769 (https://phabricator.wikimedia.org/T135991) [11:07:09] RECOVERY - mediawiki-installation DSH group on mw1275 is OK: OK [11:07:10] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:08:47] "There have been multiple failed attempts to log in to your account from a new device" [11:08:51] Hmm [11:08:55] Started again [11:09:35] Actually about 2 hours ago according to grafana [11:09:47] yeah, same to me on MABot [11:10:29] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:18:43] <_joe_> one should guess Bsadowski1, Hauskatze ack [11:19:26] _joe_: I've filed T194025 for the current issue [11:19:26] T194025: Thousands of logging throttled into Wikimedia accounts going on (ongoing bruteforce attack?) - https://phabricator.wikimedia.org/T194025 [11:20:07] <_joe_> Hauskatze: there is another ticket about this [11:20:26] _joe_: I just saw after andre__ mentioned it on the task [11:21:01] if I could subscribe to T193762 I could setup some blocks [11:21:29] <_joe_> Hauskatze: I think security is handling it. [11:21:41] perfect [11:28:23] !log installing Java security updates on elastic* hosts [11:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:21] (03PS1) 10KartikMistry: apertium-streamparser: Initial Debian packaging [debs/contenttranslation/apertium-streamparser] - 10https://gerrit.wikimedia.org/r/431553 (https://phabricator.wikimedia.org/T192987) [11:47:30] 10Operations, 10Traffic: Setup a new PKI software as an alternative to the puppet CA for managing services certificates - https://phabricator.wikimedia.org/T194031#4186323 (10Joe) p:05Triage>03Normal [11:56:45] (03CR) 10Filippo Giunchedi: prometheus: alert on per-site HTTP availability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431542 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [12:00:29] Hauskatze: T194025 might be a dup of T193769? [12:00:30] T193769: Thousands of failed login attempts (wrong password) - https://phabricator.wikimedia.org/T193769 [12:00:30] T194025: Thousands of logging throttled into Wikimedia accounts going on (ongoing bruteforce attack?) - https://phabricator.wikimedia.org/T194025 [12:00:32] Ah meh, gone [12:00:51] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [12:02:02] RECOVERY - mediawiki-installation DSH group on mw1235 is OK: OK [12:02:13] andre__: I think there's already a public task to dupe it against [12:02:54] yeah [12:03:53] there is. just wasn't sure it's the same thing. Thanks! [12:05:54] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for clamav-freshclam [puppet] - 10https://gerrit.wikimedia.org/r/427916 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:07:18] 10Operations, 10Phabricator, 10SRE-Access-Requests: Phabricator admin request for Ladsgroup - https://phabricator.wikimedia.org/T194030#4186382 (10Peachey88) [12:14:49] (03CR) 10Gehel: [C: 031] "LGTM, puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler02/11131/" [puppet] - 10https://gerrit.wikimedia.org/r/431550 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [12:15:08] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#4186414 (10fgiunchedi) Upstream has fixed the issue, should be included in the next rsyslog release. When that happens we'll try it out on the central syslog servers. [12:16:01] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#4186415 (10fgiunchedi) [12:27:33] (03PS2) 10Gehel: Bump kartotherian_storage_id to v4 [puppet] - 10https://gerrit.wikimedia.org/r/431550 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [12:27:39] !log reimaging mw1333 to stretch (last app server in eqiad) [12:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:30] (03CR) 10Gehel: [C: 032] Bump kartotherian_storage_id to v4 [puppet] - 10https://gerrit.wikimedia.org/r/431550 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [12:30:51] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [12:42:03] (03PS3) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) [12:42:31] 10Operations, 10User-fgiunchedi: mw1230 sdb "Raw_Read_Error_Rate" SMART - https://phabricator.wikimedia.org/T194036#4186489 (10fgiunchedi) [12:43:51] * elukey stares at analytics1032 [12:46:44] (03CR) 10Elukey: "PCC: https://puppet-compiler.wmflabs.org/compiler02/11132/" [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [12:47:55] (03PS4) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) [12:50:16] !log sbisson@tin Started deploy [kartotherian/deploy@425f279]: Kartotherian: stop using tilerator_storage_id; Add babel upstream of osm-intl [12:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:27] (03CR) 10Filippo Giunchedi: "I think it'd be better to use ensure => for cassandra::metrics to stop the systemd service and do the necessary cleanups, as opposed to do" [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [12:55:57] !log sbisson@tin Finished deploy [kartotherian/deploy@425f279]: Kartotherian: stop using tilerator_storage_id; Add babel upstream of osm-intl (duration: 05m 41s) [12:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:27] (03PS1) 10Sbisson: Make test and test2 using maps i18n correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431561 (https://phabricator.wikimedia.org/T191655) [12:59:40] (03PS5) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) [13:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180507T1300). [13:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:30] I can SWAT today [13:01:38] Urbanecm: around for swat? [13:02:58] zeljkof: I added my patch literally 1s after the bot announcement, and I'm here :) [13:03:22] stephanebisson: want to deploy it yourself, or should I? [13:04:03] I cannot do it. [13:04:20] in, that case, I'm on it :) [13:04:25] I'm here [13:04:27] zeljkof, :) [13:04:29] Sorry for lateness [13:04:43] stephanebisson: I'll ping you in a few minutes when it's at mwdebug [13:04:59] (03CR) 10Elukey: [C: 04-1] "We apologize for the delay, relax and enjoy your day while Luca tries to make a decent code change." [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [13:05:14] Urbanecm: no problem, you are next then, after stephanebisson, but your patch says "do not deploy" :/ [13:05:21] forgot to remove the comment? [13:05:42] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431561 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [13:05:58] 10Operations: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766#4186575 (10fgiunchedi) >>! In T193766#4183305, @herron wrote: >>>! In T193766#4181267, @fgiunchedi wrote: >> * Capacity - I chatted with @gehel at the last ops friday hangout about ELK and friends, it would be nice to get our feet... [13:06:00] zeljkof, gerrit doesn't support remove comments. It was a CR-1, I removed the code review. [13:06:06] If I should do anything else, please let me know [13:06:29] Urbanecm: all good, feel free to rebase the patch and leave a new comment saying it's ready :) [13:06:35] Ok, will do. [13:06:40] (03PS2) 10Urbanecm: Enable on Marathi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429763 (https://phabricator.wikimedia.org/T193371) [13:06:48] (03Merged) 10jenkins-bot: Make test and test2 using maps i18n correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431561 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [13:06:50] (03CR) 10Urbanecm: "The patch is ready for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429763 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [13:06:56] (03CR) 10Ema: [C: 031] prometheus: alert on per-site HTTP availability (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431542 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [13:12:03] stephanebisson: mwdebug1002 is taking forever to sync... [13:12:26] zeljkof: no worries [13:12:45] 10Operations, 10Traffic, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4186612 (10Vgutierrez) [13:12:45] stephanebisson: ok, the patch is finally at mwdebug1002, please test and let me know if I can deploy it [13:12:49] 10Operations, 10Traffic: Gather 24h data cluster wide of AES128-SHA usage - https://phabricator.wikimedia.org/T193376#4186609 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [13:12:51] on it [13:13:17] (03PS3) 10Zfilipin: Enable on Marathi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429763 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [13:14:40] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for clamav-freshclam [puppet] - 10https://gerrit.wikimedia.org/r/427916 (https://phabricator.wikimedia.org/T135991) [13:15:10] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for clamav-freshclam [puppet] - 10https://gerrit.wikimedia.org/r/427916 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:15:39] zeljkof: looks good [13:15:49] stephanebisson: ok, deploying [13:16:17] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429763 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [13:16:55] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:431561|Make test and test2 using maps i18n correctly (T191655)]] (duration: 01m 00s) [13:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:59] T191655: Deploy maps internationalization to production - https://phabricator.wikimedia.org/T191655 [13:17:12] stephanebisson: it's deployed, please test and thanks for deploying with #releng! ;) [13:17:12] (03PS2) 10Addshore: BETA ONLY - WikibaseLexeme config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) [13:17:27] (03Merged) 10jenkins-bot: Enable on Marathi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429763 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [13:17:28] zeljkof: thank you! [13:18:06] Urbanecm: the patch is at mwdebug [13:18:11] zeljkof, ack, testing [13:18:18] (03PS1) 10Addshore: BETA ONLY - Enable WikibaseLexeme on BETA wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431563 (https://phabricator.wikimedia.org/T191459) [13:18:20] (03CR) 10jerkins-bot: [V: 04-1] BETA ONLY - WikibaseLexeme config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) (owner: 10Addshore) [13:18:38] (03CR) 10Addshore: [C: 04-2] "Not Yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431563 (https://phabricator.wikimedia.org/T191459) (owner: 10Addshore) [13:19:20] (03PS6) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) [13:19:23] (03CR) 10jerkins-bot: [V: 04-1] BETA ONLY - Enable WikibaseLexeme on BETA wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431563 (https://phabricator.wikimedia.org/T191459) (owner: 10Addshore) [13:20:12] (03PS3) 10Addshore: BETA ONLY - WikibaseLexeme config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) [13:20:35] (03PS2) 10Addshore: BETA ONLY - Enable WikibaseLexeme on BETA wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431563 (https://phabricator.wikimedia.org/T191459) [13:20:39] zeljkof, follow-up needed. Will upload. [13:20:47] Urbanecm: ok [13:21:19] (03CR) 10jerkins-bot: [V: 04-1] BETA ONLY - WikibaseLexeme config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) (owner: 10Addshore) [13:21:41] (03CR) 10jerkins-bot: [V: 04-1] BETA ONLY - Enable WikibaseLexeme on BETA wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431563 (https://phabricator.wikimedia.org/T191459) (owner: 10Addshore) [13:22:12] (03PS1) 10Urbanecm: Follow-up for I6af1fc2f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431564 (https://phabricator.wikimedia.org/T193371) [13:22:35] zeljkof, can you merge&push to mwdebug 431564 as well? [13:22:40] going to add it to the calendar as well [13:22:42] (03PS4) 10Addshore: BETA ONLY - WikibaseLexeme config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) [13:22:53] Urbanecm: reviewing [13:23:00] (03PS3) 10Addshore: BETA ONLY - Enable WikibaseLexeme on BETA wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431563 (https://phabricator.wikimedia.org/T191459) [13:23:07] * addshore is sorry for the spam [13:23:50] zeljkof, ack [13:24:19] Urbanecm: could you please amend the commit message saying what it does? [13:24:57] (03PS2) 10Urbanecm: Follow-up for I6af1fc2f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431564 (https://phabricator.wikimedia.org/T193371) [13:24:58] zeljkof, better? [13:25:02] something like "Enable MapFrame for (project)" and "Follow-up for (sha) as description [13:25:07] Urbanecm: looking [13:25:18] Oh, you probably want it in reverse order [13:25:25] Urbanecm: better, but reverse :) [13:25:44] (03PS3) 10Urbanecm: Set wgKartographerEnableMapFrame to true for mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431564 (https://phabricator.wikimedia.org/T193371) [13:25:48] zeljkof, now? [13:25:58] yes, since the first line is the subject, the rest is the description, it does not make much sense to put "follow up" as the subject :) [13:26:17] Urbanecm: looks good! :) [13:26:18] (y) [13:26:56] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431564 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [13:27:59] (03Merged) 10jenkins-bot: Set wgKartographerEnableMapFrame to true for mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431564 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [13:28:18] Looking into zuul... Postmerge jobs lasting for ~4 hours are normal? [13:28:47] Urbanecm: it's at mwdebug, and no, it's a problem, will talk with my team soon about it [13:28:56] Great! [13:29:55] Urbanecm: to make it more clear, 431564 is at mwdebug; postmerge jobs in queue for hours is a problem :) [13:30:03] Understood [13:32:27] zeljkof, are you sure that it is at mwdebug? [13:32:41] It either isn't at mwdebug or there's something strange happening [13:32:48] Urbanecm: oops, let me double check, [13:33:36] Urbanecm: try again [13:33:49] (03PS7) 10Elukey: role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) [13:33:53] Now it's working as expected. Please push things into prod! [13:33:55] Thanks! [13:34:21] Urbanecm: oops, my mistake then the first time, did not push to mwdebug, deploying [13:34:38] (03PS3) 10Filippo Giunchedi: prometheus: alert on per-site HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/431542 (https://phabricator.wikimedia.org/T186069) [13:35:05] Thank you [13:35:38] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:431564|Set wgKartographerEnableMapFrame to true for mrwiki (T193371)]] (duration: 01m 00s) [13:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:43] T193371: Enable on Marathi Wikipedia - https://phabricator.wikimedia.org/T193371 [13:35:47] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: alert on per-site HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/431542 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [13:36:00] Urbanecm: deployed, please check and thanks for deploying with #releng :D [13:36:23] !log EU SWAT finished [13:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:37] Working, thanks! As you know, I have no opinion than deploy with releng :D [13:36:56] :D [13:37:28] (03PS1) 10Marostegui: db-eqiad.php: Promote db1069 to be x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) [13:37:44] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover to happen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431566 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [13:39:06] (03PS1) 10Marostegui: x1.hosts: db1069 is the new x1 master [software] - 10https://gerrit.wikimedia.org/r/431567 (https://phabricator.wikimedia.org/T186320) [13:39:19] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover to happen" [software] - 10https://gerrit.wikimedia.org/r/431567 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [13:41:41] (03PS1) 10Marostegui: mariadb: db1069 is now x1 master [puppet] - 10https://gerrit.wikimedia.org/r/431568 (https://phabricator.wikimedia.org/T186320) [13:41:56] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover to happen" [puppet] - 10https://gerrit.wikimedia.org/r/431568 (https://phabricator.wikimedia.org/T186320) (owner: 10Marostegui) [13:44:54] (03CR) 10Elukey: "New PCC: https://puppet-compiler.wmflabs.org/compiler02/11135/" [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [13:51:20] (03CR) 10Filippo Giunchedi: [C: 031] "Yep LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [13:56:42] (03PS1) 10Marostegui: wikitech.my.cnf: Enable innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/431572 (https://phabricator.wikimedia.org/T150949) [13:57:34] (03CR) 10Marostegui: [C: 032] wikitech.my.cnf: Enable innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/431572 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [14:01:06] !log temporarily disabling puppet agents for puppetdb security update [14:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:38] (03PS2) 10Addshore: apertium-streamparser: Initial Debian packaging [debs/contenttranslation/apertium-streamparser] - 10https://gerrit.wikimedia.org/r/431553 (owner: 10KartikMistry) [14:11:51] (03CR) 10Addshore: "Removed the Bug: comment as the wrong number has been used" [debs/contenttranslation/apertium-streamparser] - 10https://gerrit.wikimedia.org/r/431553 (owner: 10KartikMistry) [14:12:24] !log puppetdb updates complete — re-enabling puppet agents [14:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:10] !log upgraded prometheus-jmx-exporter to 0.3.0-1 on puppetdb servers [14:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:18] (03PS3) 10KartikMistry: apertium-streamparser: Initial Debian packaging [debs/contenttranslation/apertium-streamparser] - 10https://gerrit.wikimedia.org/r/431553 (https://phabricator.wikimedia.org/T192978) [14:14:56] 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10herron) The puppetdb servers have been upgraded to prometheus-jmx-exporter 0.3.0-1 [14:15:15] 10Operations, 10Puppet, 10Analytics, 10Cassandra, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4186804 (10herron) [14:15:31] 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#4186806 (10Dzahn) This is now the only subtask that keeps T136562 open. @elukey Did anything change about the status of it since the last comment? [14:15:41] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/prometheus/rabbitmq-exporter.yaml] [14:15:47] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn) [14:24:07] (03CR) 10Ladsgroup: BETA ONLY - WikibaseLexeme config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) (owner: 10Addshore) [14:26:55] (03CR) 10Addshore: BETA ONLY - WikibaseLexeme config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431306 (https://phabricator.wikimedia.org/T184745) (owner: 10Addshore) [14:27:00] (03PS2) 10Bstorm: maintain_kubeusers.pp: use require_package and add python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/430539 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [14:27:13] Amir1: ^^ I'm interegued to know your thoughts :) [14:27:16] *further thoughts [14:28:06] (03CR) 10Bstorm: [C: 032] maintain_kubeusers.pp: use require_package and add python3-yaml [puppet] - 10https://gerrit.wikimedia.org/r/430539 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [14:29:00] addshore: on which one :D [14:29:12] the Wikibaselexeme config patch [14:29:18] :D [14:30:15] okay, I will write more :D [14:30:39] !log imarlier@tin Started deploy [performance/coal@50fe0dd]: Coal version that uses the graphite API to fetch data, instead of reading directly from whisper files [14:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:52] !log imarlier@tin Finished deploy [performance/coal@50fe0dd]: Coal version that uses the graphite API to fetch data, instead of reading directly from whisper files (duration: 00m 13s) [14:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:17] (03CR) 10Filippo Giunchedi: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [14:33:36] Amir1: yes, lets discuss, as I'm going to end up doing the config this week :) [14:34:20] sure [14:40:41] 10Operations, 10ops-codfw, 10Traffic, 10netops: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677#4186910 (10Papaul) @BBlack please let me know when you have time to work on this. Thanks. [14:56:11] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431581 [14:56:15] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431581 [14:56:53] (03PS1) 10Bstorm: Revert "wiki replicas: depool labsdb1010 for MCR changes" [puppet] - 10https://gerrit.wikimedia.org/r/431582 [14:58:01] (03PS1) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431583 (https://phabricator.wikimedia.org/T159354) [14:58:18] (03PS2) 10Marostegui: Revert "wiki replicas: depool labsdb1010 for MCR changes" [puppet] - 10https://gerrit.wikimedia.org/r/431582 (owner: 10Bstorm) [14:58:54] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1010 for MCR changes" [puppet] - 10https://gerrit.wikimedia.org/r/431582 (owner: 10Bstorm) [14:59:02] (03PS1) 10Jcrespo: proxysql: require proxysql package installation for module proxysql [puppet] - 10https://gerrit.wikimedia.org/r/431584 (https://phabricator.wikimedia.org/T193919) [14:59:45] (03PS3) 10Filippo Giunchedi: Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) [14:59:45] !log Reload haproxy on dbproxy1010 to repool labsdb1010 - https://phabricator.wikimedia.org/T174047 [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:36] (03CR) 10Filippo Giunchedi: [C: 032] Deprecate Diamond pdns collectors [puppet] - 10https://gerrit.wikimedia.org/r/429224 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [15:06:21] 10Operations, 10Traffic, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4186976 (10Vgutierrez) After completing T193376 and analyzing the gathered data, we've got the following results for 24h of traffic data beginning at 2018-05-03 16:57: * 46%... [15:11:57] 10Operations, 10Traffic, 10Goal: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555#4186984 (10Vgutierrez) [15:12:42] (03PS5) 10Andrew Bogott: vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 [15:14:48] (03CR) 10Filippo Giunchedi: [C: 04-1] "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [15:15:01] (03PS1) 10Ottomata: no-op: Add $enabled parameter to profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/431587 (https://phabricator.wikimedia.org/T167039) [15:15:03] (03PS1) 10Ottomata: Stop all main MirrorMaker during Kafka main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/431588 (https://phabricator.wikimedia.org/T167039) [15:15:32] (03CR) 10jerkins-bot: [V: 04-1] no-op: Add $enabled parameter to profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/431587 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:18:21] (03PS2) 10Ottomata: no-op: Add $enabled parameter to profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/431587 (https://phabricator.wikimedia.org/T167039) [15:20:40] 10Operations: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766#4187026 (10herron) >>! In T193766#4186575, @fgiunchedi wrote: >Given our requirements (i.e. missed syslogs are not a critical event) and the additional dependencies/complexity on both server and clients I believe syslog-tls is a b... [15:22:06] (03PS2) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431583 (https://phabricator.wikimedia.org/T159354) [15:25:02] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4187041 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your request is bein... [15:26:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431581 (owner: 10Marostegui) [15:27:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431581 (owner: 10Marostegui) [15:28:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1084 after alter table (duration: 00m 57s) [15:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:36] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431592 (https://phabricator.wikimedia.org/T190148) [15:29:38] (03PS3) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431583 (https://phabricator.wikimedia.org/T159354) [15:29:59] (03PS3) 10Filippo Giunchedi: memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) [15:30:01] (03PS3) 10Filippo Giunchedi: Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) [15:30:03] (03PS1) 10Filippo Giunchedi: swift: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431593 (https://phabricator.wikimedia.org/T147326) [15:30:05] (03PS1) 10Filippo Giunchedi: thumbor: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431594 (https://phabricator.wikimedia.org/T147326) [15:30:07] (03PS1) 10Filippo Giunchedi: striker: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) [15:31:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431592 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [15:32:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431592 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [15:34:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 for alter table (duration: 01m 00s) [15:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:08] !log Deploy schema change on db1097:3314 - T191519 T188299 T190148 [15:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:14] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [15:34:14] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [15:34:14] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [15:37:30] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11137/kafka1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/431587 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:37:35] (03PS3) 10Ottomata: no-op: Add $enabled parameter to profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/431587 (https://phabricator.wikimedia.org/T167039) [15:37:49] (03CR) 10Ottomata: [V: 032 C: 032] no-op: Add $enabled parameter to profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/431587 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [15:38:20] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4187136 (10herron) Would love some feedback on the patches above. In particular, are there any reservations about using `standard::mail::sender` to configure an exim listener on `localhost:... [15:38:27] 10Puppet, 10Analytics-Kanban, 10Patch-For-Review: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4187139 (10mforns) [15:38:50] (03PS4) 10Filippo Giunchedi: Deprecate Diamond tcpconnstate and nfconntrackcount [puppet] - 10https://gerrit.wikimedia.org/r/429225 (https://phabricator.wikimedia.org/T183454) [15:38:52] (03PS2) 10Filippo Giunchedi: swift: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431593 (https://phabricator.wikimedia.org/T147326) [15:38:54] (03PS2) 10Filippo Giunchedi: thumbor: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431594 (https://phabricator.wikimedia.org/T147326) [15:38:56] (03PS2) 10Filippo Giunchedi: striker: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) [15:38:58] (03PS4) 10Filippo Giunchedi: memcached: deprecate Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/429221 (https://phabricator.wikimedia.org/T183454) [15:42:35] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/11141/" [puppet] - 10https://gerrit.wikimedia.org/r/431594 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [15:42:45] (03PS4) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431583 (https://phabricator.wikimedia.org/T159354) [15:42:49] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/11139/" [puppet] - 10https://gerrit.wikimedia.org/r/431593 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [15:53:08] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite#001 nodes to webperf#001 - https://phabricator.wikimedia.org/T159354#4187197 (10Imarlier) Ops - would appreciate a merge on the patch above ^ [15:53:15] (03CR) 10Imarlier: "https://puppet-compiler.wmflabs.org/compiler03/11140/ - puppet compiler run looks good." [puppet] - 10https://gerrit.wikimedia.org/r/431583 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [15:54:09] (03PS1) 10Ottomata: Ensure confluent package systemd units are disabled [puppet] - 10https://gerrit.wikimedia.org/r/431599 (https://phabricator.wikimedia.org/T167039) [15:54:47] (03CR) 10Anomie: Raise Scribunto maxLangCacheSize to 200 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [15:54:56] (03PS2) 10Anomie: Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) [15:59:45] marlier: sure thing, looking at https://gerrit.wikimedia.org/r/431583 now [15:59:59] (03CR) 10Alexandros Kosiaris: [C: 032] coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431583 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:00:06] (03CR) 10Alexandros Kosiaris: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/431583 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:00:10] (03CR) 10Krinkle: "LGTM, but site on webperf might not work right away, seems non-fatal though, so seems fine to ignore even if so." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431583 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [16:00:33] herron: thanks! [16:01:47] haha akosiaris got there first! [16:02:16] (03PS1) 10Alexandros Kosiaris: Revert "coal: require requests module; deploy to webperf" [puppet] - 10https://gerrit.wikimedia.org/r/431601 [16:02:23] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "coal: require requests module; deploy to webperf" [puppet] - 10https://gerrit.wikimedia.org/r/431601 (owner: 10Alexandros Kosiaris) [16:02:49] marlier: I had to revert :-( [16:02:59] Error 500 on SERVER: Server Error: Invalid relationship: Git::Clone[performance/docroot] { notify => Service[apache2] }, because Service[apache2] doesn't seem to be in the catalog [16:03:15] Fun, I just commented about that, thinking it's non-fatal. [16:03:19] Noticed it as warning in the puppet compiler output [16:03:30] 10Operations, 10Analytics, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#4187253 (10mforns) [16:03:33] 10Operations, 10Analytics, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10mforns) p:05Normal>03Low [16:03:39] 10Operations, 10Analytics, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10mforns) p:05Low>03Normal [16:03:42] But it breaks the puppet master's catalog apparently? [16:04:10] some things that break on master or the client pass in the compiler; it's not a complete validation. [16:04:18] (in general) [16:04:28] Ugh, sorry [16:04:52] herron: thanks, sorry about that [16:05:06] I'll track that down [16:05:08] marlier: I guess it works currently because something else is ensuring it for graphite1001 [16:05:19] Krinkle exactly [16:05:22] Nice that we found it though :) [16:06:11] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:06:12] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:06:25] Would have preferred to find it before it broke things, all things considered. But yeah, better to know about implied deps [16:07:11] Those icinga alerts will fix themselves on next run, right? [16:08:15] marlier: Yeah, it didn't do anything on a server besides the puppet master. [16:08:31] puppet does a dry run validation (a better one than the puppet compiler job) before doing anything [16:08:39] (I think) [16:09:30] Oh, those icinga ones, right, servers pulling puppet master will fail early with a no-op, but should recover given the revert. [16:11:11] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:11:21] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:12:14] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4187271 (10Papaul) a:05Papaul>03Jgreen @Jgreen Power drained on the server and update all the firmwares as well. Let me know if you need... [16:12:24] (03CR) 10BryanDavis: [C: 031] toolforge k8s: allow /etc/wmcs-project to be mounted [puppet] - 10https://gerrit.wikimedia.org/r/431285 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [16:21:16] (03PS3) 10Andrew Bogott: toolforge k8s: allow /etc/wmcs-project to be mounted [puppet] - 10https://gerrit.wikimedia.org/r/431285 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [16:21:55] (03CR) 10Andrew Bogott: [C: 032] toolforge k8s: allow /etc/wmcs-project to be mounted [puppet] - 10https://gerrit.wikimedia.org/r/431285 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [16:23:02] !log sbisson@tin Started deploy [kartotherian/deploy@9935fdb]: Kartotherian: remove temporary and unused osm-intl-i18n source [16:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:50] (03PS3) 10Andrew Bogott: maintain-kubeusers.systemd: get the project name from @labsproject [puppet] - 10https://gerrit.wikimedia.org/r/431110 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [16:25:43] (03CR) 10Andrew Bogott: [C: 032] maintain-kubeusers.systemd: get the project name from @labsproject [puppet] - 10https://gerrit.wikimedia.org/r/431110 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [16:26:30] !log sbisson@tin Finished deploy [kartotherian/deploy@9935fdb]: Kartotherian: remove temporary and unused osm-intl-i18n source (duration: 03m 28s) [16:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431592 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:32:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431539 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:32:31] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [16:33:29] this is the second time today, the BBU might be not healthy [16:33:56] (03CR) 10jenkins-bot: Enable on Marathi Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429763 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [16:34:26] James_F: so I'm being told you cannot join -dev? Do you get any error message? [16:34:41] (03CR) 10jenkins-bot: Make test and test2 using maps i18n correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431561 (https://phabricator.wikimedia.org/T191655) (owner: 10Sbisson) [16:35:07] Hauskatze: Just the usual "Cannot join channel (+b) - you are banned". [16:35:07] elukey: try a learnin cycle [16:35:24] James_F: even now? [16:35:27] :-/ [16:35:31] (03PS1) 10Bstorm: wiki replicas: Depool labsdb1011 for MCR table changes [puppet] - 10https://gerrit.wikimedia.org/r/431605 (https://phabricator.wikimedia.org/T174047) [16:35:40] (03CR) 10jenkins-bot: Set wgKartographerEnableMapFrame to true for mrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431564 (https://phabricator.wikimedia.org/T193371) (owner: 10Urbanecm) [16:35:56] Hauskatze: Yeah. Don't worry about it, the bots will just have to shout into the void without me for a bit. :-) [16:36:09] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431543 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:36:27] (03CR) 10Marostegui: [C: 032] wiki replicas: Depool labsdb1011 for MCR table changes [puppet] - 10https://gerrit.wikimedia.org/r/431605 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [16:36:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431581 (owner: 10Marostegui) [16:37:09] (03CR) 10Dzahn: "aha! yea, i didn't reboot after applying the role and didn't realize that. gotcha that it's not needed then. though... wouldn't it still b" [puppet] - 10https://gerrit.wikimedia.org/r/431057 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [16:37:48] !log Reload haproxy on dbproxy1010 to depool labsdb1011 - https://phabricator.wikimedia.org/T174047 [16:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:25] (03CR) 10Eevans: [C: 031] role::aqs: deprecate cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/431546 (https://phabricator.wikimedia.org/T186567) (owner: 10Elukey) [16:43:18] jynus: ack thanks! [16:44:16] elukey: that is like putting batteries on the fridge- not really a solution, but worth a try? [16:44:48] yes definitely [16:52:04] (03PS2) 10Dzahn: network: add mwmaint1001 to network constants [puppet] - 10https://gerrit.wikimedia.org/r/430522 (https://phabricator.wikimedia.org/T192092) [16:52:42] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [16:54:06] (03CR) 10Zhuyifei1999: "I think something broke here:" [puppet] - 10https://gerrit.wikimedia.org/r/430539 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [16:56:54] (03CR) 10Dzahn: [C: 031] "thanks for adding this! i'll second what Volans said, lgtm but please test" [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata) [16:57:35] !log executed sudo megacli -AdpBbuCmd -BbuLearn -aALL -NoLog on analytics1032 - BBU alerts flapping [16:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:01] jynus: and then i had to check on the battery/fridge thing :) Energizer and Duracell apparently say not to do it :) [17:00:04] gehel: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikidata Query Service weekly deploy . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180507T1700). [17:00:14] jouncebot: o/ [17:01:02] (03CR) 10Dzahn: [C: 032] network: add mwmaint1001 to network constants [puppet] - 10https://gerrit.wikimedia.org/r/430522 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [17:01:31] mutante: Energizer and Duracell are part of the Big Battery conspiracy, clearly [17:01:48] hahaa ;) [17:03:29] !log gehel@tin Started deploy [wdqs/wdqs@bd4b3ed]: new wdqs GUI, updater and blazegraph [17:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:40] (03PS1) 10Krinkle: mtail: Add test case from current varnishncsa sample [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) [17:04:09] (03CR) 10jerkins-bot: [V: 04-1] mtail: Add test case from current varnishncsa sample [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) (owner: 10Krinkle) [17:05:07] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4187502 (10Krinkle) @Vgutierrez @ema I'm working on using the Prometheus metrics for the ResourceLoader dashboards but running into an issue with the `va... [17:05:11] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS2914/IPv6: Active, AS2914/IPv4: Active, AS1299/IPv4: Active [17:06:51] (03PS2) 10Krinkle: mtail: Update a /w/load.php test case from a current varnishncsa sample [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) [17:07:40] (03CR) 10jerkins-bot: [V: 04-1] mtail: Update a /w/load.php test case from a current varnishncsa sample [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) (owner: 10Krinkle) [17:08:21] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 313 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [17:08:47] !log gehel@tin Finished deploy [wdqs/wdqs@bd4b3ed]: new wdqs GUI, updater and blazegraph (duration: 05m 18s) [17:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:11] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers dns5002.wikimedia.org are marked down but pooled [17:09:26] (03PS3) 10Krinkle: mtail: Update a /w/load.php test case from a current varnishncsa sample [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) [17:09:38] SMalyshev: wdqs deployment completed, tests are green [17:09:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 243 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [17:09:52] gehel: cool, thanks! [17:10:15] (03CR) 10jerkins-bot: [V: 04-1] mtail: Update a /w/load.php test case from a current varnishncsa sample [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) (owner: 10Krinkle) [17:10:57] Cool if I deploy a security related thing? [17:11:14] (03CR) 10Krinkle: "Updated the asserted expectation of the unrelated H2 test case, but the INM test is still failing. I guess the inm .* regex is matching to" [puppet] - 10https://gerrit.wikimedia.org/r/431608 (https://phabricator.wikimedia.org/T184942) (owner: 10Krinkle) [17:11:36] (03PS1) 10Catrope: Remove unused wgKartographerDfltStyle setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431609 [17:12:01] 10Operations, 10Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#4187506 (10elukey) A ton of job have been moved to Kafka, but there's still work to be done. In theory when T190327 is finished we could easily do the work :) [17:12:21] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53_udp: Servers dns5001.wikimedia.org are marked down but pooled [17:12:22] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy [17:13:12] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1039 - https://phabricator.wikimedia.org/T184262#4187512 (10Cmjohnson) [17:13:29] 10Operations, 10DBA, 10decommission, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4187514 (10Cmjohnson) [17:13:36] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1039 - https://phabricator.wikimedia.org/T184262#3877626 (10Cmjohnson) 05Open>03Resolved [17:14:38] 10Operations, 10DBA, 10decommission, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3297517 (10Cmjohnson) [17:14:41] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1034 - https://phabricator.wikimedia.org/T182556#4187527 (10Cmjohnson) 05Open>03Resolved [17:15:00] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1034 - https://phabricator.wikimedia.org/T182556#3827026 (10Cmjohnson) [17:19:01] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers dns5001.wikimedia.org are marked down but pooled [17:20:41] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 211, down: 6, shutdown: 0 [17:22:00] (03CR) 10Dzahn: [C: 031] "thanks! confirmed on mwmaint1001 it did not try to install the package" [puppet] - 10https://gerrit.wikimedia.org/r/431584 (https://phabricator.wikimedia.org/T193919) (owner: 10Jcrespo) [17:22:11] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy [17:22:31] (03PS1) 10Catrope: Enable $wgKartographerUsePageLanguage everywhere in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431612 (https://phabricator.wikimedia.org/T112948) [17:22:50] (03CR) 10Catrope: [C: 032] Enable $wgKartographerUsePageLanguage everywhere in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431612 (https://phabricator.wikimedia.org/T112948) (owner: 10Catrope) [17:23:39] jynus: btw, another thing related to terbium. it is the webserver backend for dbtree [17:23:53] i have a patch to switch that over to the new host as well but haven't tested the websites yet [17:24:02] just applied puppet role so far [17:24:09] (03Merged) 10jenkins-bot: Enable $wgKartographerUsePageLanguage everywhere in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431612 (https://phabricator.wikimedia.org/T112948) (owner: 10Catrope) [17:24:14] backend behind misc-web that is [17:24:26] (03CR) 10jenkins-bot: Enable $wgKartographerUsePageLanguage everywhere in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431612 (https://phabricator.wikimedia.org/T112948) (owner: 10Catrope) [17:25:14] 10Operations, 10User-fgiunchedi: mw1230 sdb "Raw_Read_Error_Rate" SMART - https://phabricator.wikimedia.org/T194036#4187562 (10Dzahn) p:05Triage>03Normal [17:25:32] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers dns5002.wikimedia.org are marked down but pooled [17:25:41] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy [17:26:41] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy [17:27:45] (03CR) 10Ottomata: [V: 032 C: 032] Ensure confluent package systemd units are disabled [puppet] - 10https://gerrit.wikimedia.org/r/431599 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [17:27:48] (03PS2) 10Ottomata: Ensure confluent package systemd units are disabled [puppet] - 10https://gerrit.wikimedia.org/r/431599 (https://phabricator.wikimedia.org/T167039) [17:27:50] (03CR) 10Ottomata: [V: 032 C: 032] Ensure confluent package systemd units are disabled [puppet] - 10https://gerrit.wikimedia.org/r/431599 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [17:27:53] (03PS1) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) [17:28:30] (03CR) 10jerkins-bot: [V: 04-1] coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [17:28:52] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53_udp: Servers dns5002.wikimedia.org are marked down but pooled [17:29:19] ^ FYI can ignore these lvs5* problems (or other eqsin problems that may pop up in general) [17:29:35] eqsin has been depooled from service since friday over router issues, still being worked on [17:29:44] !log updating f/w lvs1016 [17:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:01] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy [17:30:06] i figured that was what it was about in eqsin. thanks for confirming [17:34:13] (03CR) 10Daniel Kinzler: "So, what is this blocked on? Agreement on the ticket?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421122 (https://phabricator.wikimedia.org/T190015) (owner: 10Gergő Tisza) [17:34:22] (03CR) 10Giuseppe Lavagetto: [C: 032] vim: don't use Stretch's default, infuriating mouse mode [puppet] - 10https://gerrit.wikimedia.org/r/430937 (owner: 10Andrew Bogott) [17:34:40] !log bawolff@tin Synchronized php-1.32.0-wmf.2/extensions/LoginNotify/includes/Hooks.php: https://gerrit.wikimedia.org/r/#/c/431611/ Do not send loginnotify emails for throttled logins (duration: 01m 08s) [17:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:41] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[confluent-kafka],Service[confluent-kafka-connect],Service[confluent-zookeeper] [17:36:35] ^ should be fixed.... oh kafka1013!? [17:36:38] that hsouldn't happen checking.... [17:37:32] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 4 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[confluent-kafka],Service[confluent-kafka-connect],Service[confluent-zookeeper] [17:40:22] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 11 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [17:40:45] (03PS2) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) [17:41:16] (03CR) 10jerkins-bot: [V: 04-1] coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [17:41:25] (03PS1) 10Ottomata: Ensure scala and kafka version on kafka analytics-eqiad cluster [puppet] - 10https://gerrit.wikimedia.org/r/431617 [17:42:06] (03PS1) 10Zhuyifei1999: Unbreak maintain_kubeusers.pp from dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/431618 (https://phabricator.wikimedia.org/T190893) [17:43:00] (03PS3) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) [17:43:52] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 0 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [17:44:13] anyone else having trouble sshing in this morning? [17:45:04] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11148/kafka1013.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/431617 (owner: 10Ottomata) [17:46:46] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#4187651 (10Cmjohnson) [17:46:52] (03PS2) 10Andrew Bogott: Unbreak maintain_kubeusers.pp from dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/431618 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [17:47:43] (03CR) 10Andrew Bogott: [C: 032] Unbreak maintain_kubeusers.pp from dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/431618 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [17:47:51] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Service[confluent-kafka],Service[confluent-kafka-connect],Service[confluent-zookeeper] [17:49:33] (03CR) 10Zhuyifei1999: "Yep, puppet is now working" [puppet] - 10https://gerrit.wikimedia.org/r/431618 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [17:50:38] (03PS4) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) [17:51:01] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:53:57] 10Operations, 10Traffic, 10netops: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4187668 (10ayounsi) Current troubleshooting actions based on JTAC suggested next step: ```lang=diff [edit system] - commit synchronize; [edit chassis redundancy] - graceful-switchover; [ed... [17:58:26] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4187673 (10Cmjohnson) @vguiterrez I updated the firmware on lvs1016 [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180507T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:36] * Niharika pets jouncebot [18:02:28] (03Abandoned) 10Dzahn: add mgmt DNS for nihonium, new eqiad maintenance server [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:03:11] Niharika: if still allowed, got a patch for swat but I have not added it due to lazyness :| [18:03:25] Hauskatze: Sure! [18:03:37] otoh, got to go in ~10 minutes so maybe another time [18:03:45] No worries. [18:04:02] Niharika: all good with the renames? [18:04:07] (03CR) 10Dzahn: [C: 032] "better/more explanation in https://gerrit.wikimedia.org/r/#/c/416981/ - i did the exact same thing twice :p" [puppet] - 10https://gerrit.wikimedia.org/r/431047 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:04:15] Hauskatze: Yep, thank you for your help. [18:04:20] (03Abandoned) 10Dzahn: mediawiki_maintenance: activate crons based on fqdn, not mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/416981 (owner: 10Dzahn) [18:04:30] I'll reply you on meta but certainly having UserMerge fixed to work on WMF would be of great help [18:06:18] (03PS2) 10Dzahn: cache::misc: add apache-fast-test script [puppet] - 10https://gerrit.wikimedia.org/r/423557 [18:08:02] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:09:44] (03PS3) 10Dzahn: superset: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/416742 [18:09:44] robh papaul ^ for some reason I can't ssh into the cluster this morning [18:10:20] maybe ny ip is blacklisted? I'm at a café [18:11:14] (03PS5) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) [18:11:31] AndyRussG: what hostname are you trying to connect to? [18:11:43] (03CR) 10jerkins-bot: [V: 04-1] coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [18:12:24] AndyRussG: and which bastion host do you use in your ssh config please [18:12:29] mutante: I've tried several... Wishing to go to stat1004 or stat1002, but can't get to tin or bastion [18:12:49] mutante: it used to be bast1001 but I just switched it to bast2001, to no avail [18:12:52] what is the error you are getting? [18:13:00] bast1001 is outdated [18:13:06] ssh_exchange_identification: Connection closed by remote host [18:13:12] first it hangs for a while, tho [18:13:47] (03PS6) 10Imarlier: coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) [18:14:06] AndyRussG: bast1001 does not exist anymore, replaced by bast1002. bast2001 is correct and should work but i don't see a failed attempt from your user to login at all [18:15:03] mutante: hmmm... was just trying with 2001 [18:15:06] now trying 1002 [18:15:08] AndyRussG: use the one closest to you on the map. https://wikitech.wikimedia.org/wiki/Bastion [18:16:30] AndyRussG: can you PM me your IP address [18:16:36] mutante: right... wait, hmmm it's still trying on bast1001 for some reason [18:16:38] one sec [18:16:51] is it the same as your IRC user? [18:17:24] ah, ok, yep [18:18:07] mutante: oops yeah I didn't update the bastion in the right line in my ssh config.... apologies!!! [18:18:21] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:18:46] AndyRussG: no worries, glad it works. btw here are the fingerprints for the new one https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bast1002.wikimedia.org [18:19:03] (03CR) 10Imarlier: "Compiler run shows that apache2 package and service are now being installed as expected on webperf1001: https://puppet-compiler.wmflabs.or" [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [18:19:12] mutante: ah yeah I was gonna ask... thanks so much!!!! [18:19:16] herron: If you have a second, would appreciate a look at https://gerrit.wikimedia.org/r/#/c/431615/ -- same change as before, but now with moar apache2 [18:19:16] yw [18:20:37] thanks for using "httpd" class and not "apache" class :) [18:22:08] (03CR) 10Dzahn: [C: 031] "apache part looks good. thanks for using 'httpd' class vs. the old 'apache' module" [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [18:23:17] (03CR) 10Dzahn: [C: 031] "(and it needs to go into the role class, ACK)" [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [18:25:06] (03CR) 10Imarlier: "> (and it needs to go into the role class, ACK)" [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [18:33:32] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [18:39:23] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4187745 (10cwdent) @Papaul I tried power cycling just now and the same thing happened again, nothing on vsp and I can't reach it on the normal... [18:39:28] (03PS1) 10Sbisson: Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 [18:40:15] o/ Hello! I'm trying to enable the role::labs::mediawiki_vagrant in Horizon but if I understand correctly it'll apply to _every_ instance instead of the single instance I want it on. What's the best way to apply this to one machine without mucking up the others? I'm trying to follow the instructions here: https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Cloud_VPS. [18:41:22] I've also tried just manually setting up a Vagrant instance but I keep hitting issues with getting a proper LXC config. [18:41:51] (03CR) 10Catrope: [C: 031] Enable maps i18n everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431628 (owner: 10Sbisson) [18:42:47] (It'll fail for odd reasons like "There was an error executing lxc-info" but when I run with debug enabled and execute the same command, it seems ok to me. I've tried with and without NFS shares.) [18:48:49] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4187787 (10Papaul) @cwdent if power drain and the firmware update didn't work it might be a hardware issue ( ILO interface or main baord) in t... [18:49:43] When I go to openstack-browser, I can see that _one_ of our instances _is_ configured for the mediawiki_vagrant role but I'm not sure how that was ever applied to just the single instance. https://tools.wmflabs.org/openstack-browser/puppetclass/role::labs::mediawiki_vagrant [18:52:33] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4187802 (10Vgutierrez) @Cmjohnson I still see the same FW version from ethtool and same MSI-X: ```name=FW version root@lvs1016:~# ethtool -i enp4s0f0 |grep firmware firmware... [18:54:08] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4187815 (10Jgreen) >>! In T193891#4187787, @Papaul wrote: > @cwdent if power drain and the firmware update didn't work it might be a hardware... [18:55:31] niedzielski: in the Horizon web UI there are multiple ways to add a role to an instance, one is "project-based" (what you mentioned, applies to all) one is "prefix-based" (you can apply wildcard on the start of a host name) and then you can also click on a single instance and configure it to use the role [18:55:43] niedzielski: more details probably better on the #wikimedia-cloud channel [18:56:14] but yea, click on the instance name and then configure and select the role for just that instance [18:57:30] mutante: perfect. you're a lifesaver! [18:57:47] you're welcome [18:59:40] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4187837 (10Papaul) @Jgreen if the ILO card is on the main board, the whole main board will have to be replaced if not only the ILO card will... [19:07:17] mutante: Would you be able to +2 my change from above? [19:24:13] marlier: just affects webperf* right? [19:24:17] doing that now [19:24:24] Exactly, thanks! [19:24:41] i do see the compiler link, all good [19:25:02] (03CR) 10Dzahn: [C: 032] coal: require requests module; deploy to webperf [puppet] - 10https://gerrit.wikimedia.org/r/431615 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [19:26:40] (03CR) 10Krinkle: [C: 031] Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) (owner: 10Anomie) [19:27:25] marlier: merged and ran puppet on the hosts and it looks mostly good but one issue remains maybe [19:27:28] just a sec [19:29:34] mutante: i've updated the docs and the server works! thanks again! http://readers-web-master.wmflabs.org/wiki/Main_Page https://wikitech.wikimedia.org/w/index.php?title=Help:MediaWiki-Vagrant_in_Cloud_VPS&oldid=1790649 [19:30:32] (03PS1) 10Dzahn: performance::site: require libapache2-mod-uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/431636 (https://phabricator.wikimedia.org/T159354) [19:30:56] niedzielski: great:) [19:31:33] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[ensure_present_mod_uwsgi] [19:31:44] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[ensure_present_mod_uwsgi] [19:31:50] mutante: interesting, I would have expected that mod to be automatically required due to the specification of uwsgi as a module... [19:32:24] (03PS2) 10Dzahn: performance::site: require libapache2-mod-uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/431636 (https://phabricator.wikimedia.org/T159354) [19:33:16] marlier: yes, *nod* it currently doesn't do that [19:33:26] another thing we should add to some README, you are right [19:34:53] (03PS3) 10Dzahn: performance::site: require libapache2-mod-uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/431636 (https://phabricator.wikimedia.org/T159354) [19:35:31] (03CR) 10Dzahn: [C: 032] performance::site: require libapache2-mod-uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/431636 (https://phabricator.wikimedia.org/T159354) (owner: 10Dzahn) [19:36:34] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [19:36:43] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:36:52] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4188027 (10Jgreen) [19:36:52] marlier: ^ so far so good [19:37:01] there is something different now [19:37:06] mutante: sweet! [19:37:41] marlier: Invalid command 'ProxyPass', in Apache config [19:37:51] need to load the proxy module too [19:37:52] Ah, needs another modules [19:38:01] proxy and proxy_html (maybe) [19:38:31] let me try, i'll run the same command puppet would run [19:39:22] yea, just 'proxy' [19:41:22] (03PS1) 10Dzahn: performance::site: load apache mod proxy [puppet] - 10https://gerrit.wikimedia.org/r/431638 (https://phabricator.wikimedia.org/T159354) [19:41:55] (03PS2) 10Dzahn: performance::site: load apache mod proxy [puppet] - 10https://gerrit.wikimedia.org/r/431638 (https://phabricator.wikimedia.org/T159354) [19:42:27] (03CR) 10Dzahn: [C: 032] performance::site: load apache mod proxy [puppet] - 10https://gerrit.wikimedia.org/r/431638 (https://phabricator.wikimedia.org/T159354) (owner: 10Dzahn) [19:42:46] (03CR) 10Imarlier: [C: 031] performance::site: load apache mod proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431638 (https://phabricator.wikimedia.org/T159354) (owner: 10Dzahn) [19:45:27] (03CR) 10Dzahn: [C: 032] performance::site: load apache mod proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431638 (https://phabricator.wikimedia.org/T159354) (owner: 10Dzahn) [19:46:07] marlier: so the puppet runs are happy on webperf* now [19:46:15] Awesome! [19:46:18] I'll check it in a bit. [19:46:20] In a meeting [19:46:23] mod_proxy doesnt need a require_package , it should come with apache2 [19:46:34] but had to be enabled [19:47:17] that being said there is this separate package, but i dont think that's the one needed ? [19:47:20] libapache2-mod-proxy-uwsgi - uwsgi proxy module for Apache2 (mod_uwsgi) [19:48:47] " It is a proxy module, so it provides all of the futures exported by mod_proxy. [19:50:22] (03PS4) 10Ottomata: Small improvements to the geoip archive script [puppet] - 10https://gerrit.wikimedia.org/r/430067 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [19:50:26] (03CR) 10Ottomata: [V: 032 C: 032] Small improvements to the geoip archive script [puppet] - 10https://gerrit.wikimedia.org/r/430067 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [19:50:33] since you said "per graphite" and when i look at graphite, it loads "proxy" and "uwsgi" (among others that we don't load ) but i think we are good now [19:54:55] 10Operations, 10ops-codfw, 10DC-Ops: rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle - https://phabricator.wikimedia.org/T193891#4182673 (10RobH) If the mainboard dies on an out of warranty system, we typically decommission the host. We're looking to order more misc sys... [19:58:01] !log removing onboard ports license from cr1-eqsin config - T193897 [19:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:06] T193897: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897 [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180507T2000). [20:00:40] (03PS1) 10Imarlier: coal-web: needs proxy_http module as well [puppet] - 10https://gerrit.wikimedia.org/r/431644 (https://phabricator.wikimedia.org/T159354) [20:01:44] (03CR) 10Imarlier: "No rush - already did this enable by hand." [puppet] - 10https://gerrit.wikimedia.org/r/431644 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [20:03:13] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 104 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [20:06:40] (03CR) 10Dzahn: [C: 032] coal-web: needs proxy_http module as well [puppet] - 10https://gerrit.wikimedia.org/r/431644 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [20:08:03] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 73 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [20:08:34] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 8 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [20:13:04] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 1 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [20:16:43] Nothing for ORES. [20:17:40] 10Operations, 10Multimedia, 10Traffic: Update Media dashboard in Grafana to use Prometheus metrics - https://phabricator.wikimedia.org/T193445#4188187 (10Imarlier) Hey, Multimedia team -- probably makes the most sense for you to handle this. [20:19:11] !log bsitzmann@tin Started deploy [mobileapps/deploy@e20f23d]: Update mobileapps to c1f4de6 (T191538) [20:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:16] T191538: Create a feedback form for PRODUCTION users of the synced reading lists feature - https://phabricator.wikimedia.org/T191538 [20:24:08] (03PS1) 10Ottomata: Set kafka api_version on statsv instance if provided [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) [20:24:21] (03PS2) 10Ottomata: Set kafka api_version on statsv instance if provided [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) [20:24:53] (03CR) 10jerkins-bot: [V: 04-1] Set kafka api_version on statsv instance if provided [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [20:24:55] (03PS3) 10Ottomata: Set kafka api_version on statsv instance if provided [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) [20:25:02] 10Operations, 10Traffic, 10netops: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897#4188239 (10ayounsi) > So you can use either the configuration statement and as long as the configuration active on both REs no affectation should be seeing on license status or use the request s... [20:25:21] !log bsitzmann@tin Finished deploy [mobileapps/deploy@e20f23d]: Update mobileapps to c1f4de6 (T191538) (duration: 06m 09s) [20:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:25] T191538: Create a feedback form for PRODUCTION users of the synced reading lists feature - https://phabricator.wikimedia.org/T191538 [20:25:31] (03CR) 10jerkins-bot: [V: 04-1] Set kafka api_version on statsv instance if provided [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [20:26:09] (03PS4) 10Ottomata: Set kafka api_version on statsv instance if provided [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) [20:29:20] !log arlolra@tin Started deploy [parsoid/deploy@cd5e875]: Updating Parsoid to 6e38948 [20:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:28] !log otto@tin Started deploy [statsv/statsv@c186340]: Configure api.version via CLI opt -- prep for Kafka main upgrade T167039 [20:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:33] T167039: Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039 [20:30:33] !log otto@tin Finished deploy [statsv/statsv@c186340]: Configure api.version via CLI opt -- prep for Kafka main upgrade T167039 (duration: 00m 05s) [20:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:02] (03PS5) 10Ottomata: Set kafka api_version on statsv instance if provided [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) [20:33:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventBus, and 2 others: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4188269 (10Ottomata) FYI, had to do https://gerrit.wikimedia.org/r/#/c/431646/ to do Kafka upgrade. [20:33:29] (03CR) 10Ottomata: [C: 032] Set kafka api_version on statsv instance if provided [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [20:33:33] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11152/webperf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/431651 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [20:34:23] RECOVERY - MegaRAID on analytics1032 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy [20:40:22] (03CR) 10Zhuyifei1999: [C: 032] Mount & load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [20:41:04] (03Merged) 10jenkins-bot: Mount & load project name dynamically from /etc/wmcs-project [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/430647 (https://phabricator.wikimedia.org/T190893) (owner: 10Zhuyifei1999) [20:41:45] !log arlolra@tin Finished deploy [parsoid/deploy@cd5e875]: Updating Parsoid to 6e38948 (duration: 12m 25s) [20:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:00] !log Updated Parsoid to 6e38948 (T192909) [20:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:08] T192909: Dirty diff and other weird corruption in table edit - https://phabricator.wikimedia.org/T192909 [20:48:39] !log imarlier@tin Started restart [performance/coal@50fe0dd]: Restart coal-web service everywhere, hopefully [20:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:14] (03PS4) 10Catrope: Enable ORES on lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431038 (https://phabricator.wikimedia.org/T192499) [20:49:23] (03PS3) 10Catrope: Enable ORES on huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/431040 (https://phabricator.wikimedia.org/T192496) [21:00:04] bawolff and Reedy: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180507T2100). Please do the needful. [21:06:12] (03CR) 10Dzahn: [C: 032] "i see this only influences mw_maintenance, going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/431584 (https://phabricator.wikimedia.org/T193919) (owner: 10Jcrespo) [21:14:01] (03PS1) 10Andrew Bogott: Horizon: add a few more config settings for the upcoming wikimediamemberdashboard [puppet] - 10https://gerrit.wikimedia.org/r/431658 [21:14:29] (03CR) 10jerkins-bot: [V: 04-1] Horizon: add a few more config settings for the upcoming wikimediamemberdashboard [puppet] - 10https://gerrit.wikimedia.org/r/431658 (owner: 10Andrew Bogott) [21:14:43] PROBLEM - MegaRAID on analytics1032 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough [21:20:12] (03PS1) 10Imarlier: performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) [21:20:36] (03CR) 10jerkins-bot: [V: 04-1] performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [21:22:23] (03PS1) 10Ayounsi: Revert "Depool eqsin because of router issue" [dns] - 10https://gerrit.wikimedia.org/r/431660 [21:22:31] (03PS2) 10Imarlier: performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) [21:22:53] (03CR) 10Ayounsi: [C: 032] Revert "Depool eqsin because of router issue" [dns] - 10https://gerrit.wikimedia.org/r/431660 (owner: 10Ayounsi) [21:22:59] (03PS2) 10Ayounsi: Revert "Depool eqsin because of router issue" [dns] - 10https://gerrit.wikimedia.org/r/431660 [21:25:18] !log re-pool eqsin - T193897 [21:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:22] T193897: cr1-eqsin 4 onboard interfaces down - https://phabricator.wikimedia.org/T193897 [21:28:04] !log mw2216,mw2217,mw2218 - wmf-auto-reimage --conftool , reinstall with stretch [21:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:23] (03PS2) 10Andrew Bogott: Horizon: add a few config settings for the upcoming wikimediamemberdashboard [puppet] - 10https://gerrit.wikimedia.org/r/431658 [21:38:32] (03PS3) 10Ottomata: Kafka main-codfw patch 3 - remove api.version [puppet] - 10https://gerrit.wikimedia.org/r/430640 (https://phabricator.wikimedia.org/T167039) [21:49:20] (03PS1) 10Chad: Gerrit: Preemptively set Gerrit elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/431664 [21:58:39] (03CR) 10Paladox: "LGTM to me but one minor erb syntax fix." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/431664 (owner: 10Chad) [21:58:44] (03CR) 10Paladox: [C: 031] Gerrit: Preemptively set Gerrit elasticsearch config [puppet] - 10https://gerrit.wikimedia.org/r/431664 (owner: 10Chad) [22:05:36] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/11154/" [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) (owner: 10Jcrespo) [22:07:22] (03CR) 10Dzahn: [C: 031] "should i deploy it or did you want to do that yourself?" [puppet] - 10https://gerrit.wikimedia.org/r/431529 (https://phabricator.wikimedia.org/T193919) (owner: 10Jcrespo) [22:14:24] (03PS3) 10Krinkle: performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [22:14:28] (03CR) 10Krinkle: [C: 031] performance.wikimedia.org: serve from webperfX001 [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [22:18:22] (03CR) 10Krinkle: [C: 031] "Confirmed that the following all respond the same way from graphite1001, webperf1001 and webperf2001" [puppet] - 10https://gerrit.wikimedia.org/r/431659 (https://phabricator.wikimedia.org/T159354) (owner: 10Imarlier) [22:23:20] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4188556 (10Krinkle) [22:23:54] !log ppchelko@tin Started restart [changeprop/deploy@7e86531]: Restart changeprop to try forcing it rebalancing topics [22:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:24] !log labsdb1009,labsdb1010,labsdb1011 are now on up-to-date views per T174047 [22:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:28] T174047: Provide backwards compatibility views for toolforge replica [MCR] - https://phabricator.wikimedia.org/T174047 [22:31:51] (03PS1) 10Bstorm: Revert "wiki replicas: Depool labsdb1011 for MCR table changes" [puppet] - 10https://gerrit.wikimedia.org/r/431672 [22:53:09] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4188625 (10CCogdill_WMF) Thanks for the meeting on Thursday, everyone! I'm following up with IBM about potentially: * getting them to obtain a DV cert *... [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180507T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:14:18] (03PS1) 10Chad: Update non-core plugins to their respective stable-2.14 tips [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/431675 [23:37:58] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2216.codfw.wmnet [23:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:11] !log mw2219,mw2220,mw2221 - reinstall with stetch [23:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:19] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2217.codfw.wmnet [23:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:27] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2218.codfw.wmnet [23:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log