[00:48:02] (03PS1) 10Aklapper: Clarify string in weekly Phabricator Project email [puppet] - 10https://gerrit.wikimedia.org/r/303500 (https://phabricator.wikimedia.org/T142347) [00:55:35] (03CR) 10Danny B.: [C: 031] Clarify string in weekly Phabricator Project email [puppet] - 10https://gerrit.wikimedia.org/r/303500 (https://phabricator.wikimedia.org/T142347) (owner: 10Aklapper) [02:20:35] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.13) (duration: 09m 21s) [02:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Aug 8 02:26:21 UTC 2016 (duration 5m 46s) [02:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:27] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: puppet fail [03:16:05] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:12:56] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.050 second response time [04:48:29] http://subefotos.com/ver/?4dd530da4865c4c06f6e47b601cea883o.png TE LO REGALO CON TODO MI AMOR DON ALVARO MOLINA (EL SOPLON DE LOS BIBLICOS) [04:48:39] ALVARO MOLINA DEBE MORIR [04:48:46] DIE ALVARO MOLINA [04:48:48] KILL [04:48:52] LO MATARE [04:49:04] CON MIS PROPIAS MANOS [04:49:14] ES UN HIJO DE PERRA [04:49:28] Q SOLO SIRVE PARA HACER CACA Y PIPI [05:12:56] 06Operations, 06Editing-Analysis: Connection time out to stat1003 - https://phabricator.wikimedia.org/T142126#2531945 (10HJiang-WMF) I logged in to stat1002 previously, but the login process at stat1003 has been so slow that I'm not sure if it has ever logged me it at all. So looks like that it is not ever try... [06:30:23] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2531976 (10Dereckson) [06:32:27] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - 531 bytes in 0.050 second response time [06:43:36] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:05] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3677 bytes in 0.083 second response time [07:02:46] (03Abandoned) 10Giuseppe Lavagetto: [WiP] Add ipvs-related FSM [debs/pybal] - 10https://gerrit.wikimedia.org/r/272679 (owner: 10Giuseppe Lavagetto) [07:09:17] RECOVERY - puppet last run on analytics1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:25] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [07:13:15] RECOVERY - AQS root url on aqs1004 is OK: HTTP OK: HTTP/1.1 200 - 727 bytes in 0.026 second response time [07:13:33] this is me --^ [07:13:40] (non live cluster) [07:16:44] !log installing php5 security updates [07:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:39:56] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Puppet has 1 failures [07:40:15] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Puppet has 1 failures [07:40:16] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: Puppet has 1 failures [07:40:55] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Puppet has 4 failures [07:41:26] PROBLEM - puppet last run on mw2156 is CRITICAL: CRITICAL: Puppet has 4 failures [07:42:14] 06Operations, 10Ops-Access-Requests: Access for p858snake to chanops in #wikimedia-operations - https://phabricator.wikimedia.org/T142270#2531984 (10ema) p:05Triage>03Normal [07:59:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [07:59:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:01:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:03:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:05:57] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:06:16] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:06:25] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:06:56] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:07:26] RECOVERY - puppet last run on mw2156 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:26:35] 06Operations, 06Discovery, 10Traffic, 03Discovery-Search-Sprint: Setup load balancing for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098#2532030 (10Gehel) [08:27:26] PROBLEM - NTP on ganeti1002 is CRITICAL: NTP CRITICAL: Offset unknown [08:44:52] 06Operations, 06Discovery, 10Traffic, 03Discovery-Search-Sprint: Setup load balancing for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098#2532106 (10Gehel) The relforge cluster is not H/A, with 2 nodes we can only have a single master if we want to protect again a spli... [08:46:00] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::hhvm: remove the warmup job [puppet] - 10https://gerrit.wikimedia.org/r/282859 (owner: 10Giuseppe Lavagetto) [08:49:26] (03Abandoned) 10Gehel: Switching search traffic to codfw as eqiad seems unstable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303347 (owner: 10Gehel) [08:50:39] 06Operations, 06Discovery, 10Traffic, 03Discovery-Search-Sprint: Setup load balancing for elasticsearch service on relforge servers - https://phabricator.wikimedia.org/T142098#2532123 (10dcausse) @Gehel it's perfectly OK to run without H/A, this cluster is not meant to serve production traffic but only opt... [08:59:00] (03CR) 10Alexandros Kosiaris: [C: 032] Enable base::firewall for palladium [puppet] - 10https://gerrit.wikimedia.org/r/302394 (owner: 10Muehlenhoff) [08:59:06] (03PS2) 10Alexandros Kosiaris: Enable base::firewall for palladium [puppet] - 10https://gerrit.wikimedia.org/r/302394 (owner: 10Muehlenhoff) [08:59:10] (03CR) 10Alexandros Kosiaris: [V: 032] Enable base::firewall for palladium [puppet] - 10https://gerrit.wikimedia.org/r/302394 (owner: 10Muehlenhoff) [09:03:54] (03PS2) 10Muehlenhoff: druid: Limit to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/303144 [09:06:32] (03PS3) 10Muehlenhoff: druid: Limit to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/303144 [09:09:49] (03Abandoned) 10Giuseppe Lavagetto: Remove mw1001-1008 [dns] - 10https://gerrit.wikimedia.org/r/296384 (owner: 10Giuseppe Lavagetto) [09:11:19] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Goal: Modernize puppet configuration management infrastructure - https://phabricator.wikimedia.org/T139471#2532149 (10Joe) [09:11:22] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2532148 (10Joe) 05Open>03Resolved [09:13:38] !log restarting elasticsearch on logstash100[56] to pick up java security updates [09:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:17:56] jouncebot: next [09:17:57] In 5 hour(s) and 42 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160808T1500) [09:27:35] PROBLEM - puppet last run on ms-be2016 is CRITICAL: CRITICAL: Puppet has 1 failures [09:29:13] 06Operations: Ferm rules for palladium - https://phabricator.wikimedia.org/T113344#2532173 (10MoritzMuehlenhoff) 05Open>03Resolved palladium now uses base::firewall [09:30:16] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [09:30:16] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [09:30:16] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [09:32:27] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [09:32:46] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 35 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 35, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number [09:32:52] ^ should recover soon [09:32:57] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: CRITICAL - elasticsearch http://10.64.16.185:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.16.185, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [09:34:02] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2532181 (10Joe) Since we're using puppet `3.x`, we need to use puppetDB `2.3`. A debian package for it is available for trusty and can be easily adapted to jessie - it's a pretty raw package as it's basically just packing toghethe... [09:34:34] <_joe_> gehel: any idea what's up with logstash? [09:34:48] opening a bug currently [09:34:48] !log re-imaging aqs1005 to migrate Cassandra partitions to RAID10 (T142075) [09:34:49] T142075: Replace RAID0 arrays with RAID10 on aqs100[456] - https://phabricator.wikimedia.org/T142075 [09:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:47] yes, rolling restart by moritzm, experiencing https://github.com/elastic/elasticsearch/issues/19829 [09:36:37] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 6, number_of_pending_tasks: 40, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 178126, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.4761904762 [09:36:56] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 98.0952380952, acti [09:36:57] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 98.0952380952, acti [09:37:06] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 98.0952380952, acti [09:37:06] RECOVERY - ElasticSearch health check for shards on logstash1006 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 98.0952380952, acti [09:38:13] 06Operations, 10Wikimedia-Logstash: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357#2532191 (10MoritzMuehlenhoff) [09:38:18] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 98.0952380952, acti [09:39:56] 06Operations, 10Wikimedia-Logstash: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357#2532205 (10MoritzMuehlenhoff) [09:40:26] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:28] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:49] mmmm checking, this is the non live cluster so all good [09:41:38] !log legoktm@tin Synchronized php-1.28.0-wmf.13/includes/revisiondelete/: Fix inconsistent RevDelFileItem visibilities - T142228 (duration: 00m 50s) [09:41:39] T142228: Fatal error: Access level to RevDelArchivedFileItem::$file() must be public (as in class RevDelFileItem) or weaker in /srv/mediawiki/php-1.28.0-wmf.13/includes/revisiondelete/RevDelArchivedFileItem.php on line 25 - https://phabricator.wikimedia.org/T142228 [09:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:44:16] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [09:44:53] (03PS1) 10Legoktm: Fix missed $wmg -> $wg for $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303521 [09:46:00] (03CR) 10Legoktm: [C: 032] Fix missed $wmg -> $wg for $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303521 (owner: 10Legoktm) [09:46:30] (03Merged) 10jenkins-bot: Fix missed $wmg -> $wg for $wgRestbaseServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303521 (owner: 10Legoktm) [09:47:55] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 07Easy: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844#2532211 (10Gehel) Keeping track of curl `elastic1030.eqiad.wmnet:9200/_cluster/stats | jq .indices.completion.size_in_bytes` could help diagnos... [09:48:10] !log legoktm@tin Synchronized wmf-config/CommonSettings.php: Fix missed $wmg -> $wg for $wgRestbaseServer (duration: 00m 49s) [09:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:27] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [09:49:18] (03CR) 10Legoktm: "You missed one $wmg -> $wg: https://gerrit.wikimedia.org/r/303521" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298095 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [09:51:43] !log uploaded to apt.wikimedia.org jessie-wikimedia: php5_5.3.10-1ubuntu3.24+wmf1 [09:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:51:51] moritzm: ^ [09:51:57] er.. precise-wikimedia [09:51:58] RECOVERY - puppet last run on ms-be2016 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:51:59] the log is wrong [09:52:07] !log uploaded to apt.wikimedia.org precise-wikimedia: php5_5.3.10-1ubuntu3.24+wmf1 [09:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:54:27] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:54:41] ok, thanks, will upgrade the four remaining precise systems with PHP later on [09:57:39] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2002.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:41] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:43] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:45] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:47] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:48] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2007.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:50] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2008.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:52] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2009.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:54] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2010.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:55] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2011.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:57] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2012.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:58] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2013.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:58:01] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2014.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:02] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2015.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:58:04] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2016.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:05] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2017.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:58:08] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2018.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:10] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2019.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:58:13] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2020.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [09:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:59:39] (03CR) 10Alexandros Kosiaris: [C: 032] ores: puppet changes for the refactor [puppet] - 10https://gerrit.wikimedia.org/r/303356 (https://phabricator.wikimedia.org/T141575) (owner: 10Ladsgroup) [09:59:44] (03PS2) 10Alexandros Kosiaris: ores: puppet changes for the refactor [puppet] - 10https://gerrit.wikimedia.org/r/303356 (https://phabricator.wikimedia.org/T141575) (owner: 10Ladsgroup) [09:59:46] (03CR) 10Alexandros Kosiaris: [V: 032] ores: puppet changes for the refactor [puppet] - 10https://gerrit.wikimedia.org/r/303356 (https://phabricator.wikimedia.org/T141575) (owner: 10Ladsgroup) [10:02:37] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:04:27] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy [10:04:36] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [10:05:28] (03PS1) 10Alexandros Kosiaris: parsoid: Set all wtp20XX boxes as jessie [puppet] - 10https://gerrit.wikimedia.org/r/303528 (https://phabricator.wikimedia.org/T135176) [10:05:28] !log T135176 depool wtp2001-wtp2020 [10:05:29] T135176: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176 [10:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:09] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Change CP to do several models at once. - https://phabricator.wikimedia.org/T142360#2532251 (10Ladsgroup) [10:08:53] (03CR) 10Alexandros Kosiaris: [C: 032] parsoid: Set all wtp20XX boxes as jessie [puppet] - 10https://gerrit.wikimedia.org/r/303528 (https://phabricator.wikimedia.org/T135176) (owner: 10Alexandros Kosiaris) [10:09:54] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service: Increase web and worker processes in production - https://phabricator.wikimedia.org/T142361#2532269 (10Ladsgroup) [10:10:38] !log legoktm@tin Synchronized php-1.28.0-wmf.13/extensions/Cite/Cite_body.php: Cite::referencesFormatEntry: Avoid Undefined index: key - T132583 (duration: 00m 49s) [10:10:40] T132583: Undefined index: {count,key} in Cite::referencesFormatEntry method - https://phabricator.wikimedia.org/T132583 [10:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:10:58] 06Operations, 10Phabricator-Bot-Requests: Creation of bot for Operations - https://phabricator.wikimedia.org/T142362#2532287 (10Volans) [10:12:08] 06Operations: create puppetDB puppet role + debian package - https://phabricator.wikimedia.org/T142363#2532302 (10Joe) [10:13:53] 06Operations, 06Services, 06Services-next, 05Security: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2532327 (10Lea_WMDE) [10:16:58] 06Operations, 10vm-requests: eqiad+codfw: 1 VM %request for puppetDB - https://phabricator.wikimedia.org/T142365#2532329 (10Joe) [10:17:14] 06Operations, 10vm-requests: eqiad+codfw: 1 VM request for puppetDB - https://phabricator.wikimedia.org/T142365#2532341 (10Joe) [10:17:40] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2532343 (10Joe) [10:17:42] 06Operations, 10vm-requests: eqiad+codfw: 1 VM request for puppetDB - https://phabricator.wikimedia.org/T142365#2532329 (10Joe) [10:20:56] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.061 second response time [10:23:50] 06Operations, 10vm-requests: eqiad+codfw: 1 VM request for puppetDB - https://phabricator.wikimedia.org/T142365#2532350 (10akosiaris) [10:30:48] 06Operations, 10vm-requests: eqiad+codfw: 1 VM request for puppetDB - https://phabricator.wikimedia.org/T142365#2532329 (10akosiaris) Updated the task to require 40GB per VM. The rest looks fine, will proceed with the creation. This looks like misc VMs so names suggested: `nitrogen` and `nihal`. `nitrogen` has... [10:31:39] 06Operations, 06Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#2532359 (10mobrovac) [10:46:06] (03PS1) 10Giuseppe Lavagetto: Add nitrogen/nihal, puppetdb VMs [dns] - 10https://gerrit.wikimedia.org/r/303530 (https://phabricator.wikimedia.org/T142365) [10:46:08] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [10:48:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [10:49:43] (03CR) 10Giuseppe Lavagetto: [C: 032] Add nitrogen/nihal, puppetdb VMs [dns] - 10https://gerrit.wikimedia.org/r/303530 (https://phabricator.wikimedia.org/T142365) (owner: 10Giuseppe Lavagetto) [10:59:07] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 185 bytes in 1.666 second response time [11:00:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [11:08:08] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:08:12] <_joe_> !log creating nitrogen.eqiad.wmnet as a VM for puppetdb, T142365 [11:08:13] T142365: eqiad+codfw: 1 VM request for puppetDB - https://phabricator.wikimedia.org/T142365 [11:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:21] <_joe_> !log creating nihal.codfw.wmnet as a VM for puppetdb, T142365 [11:17:22] T142365: eqiad+codfw: 1 VM request for puppetDB - https://phabricator.wikimedia.org/T142365 [11:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:18:31] 06Operations, 10Phabricator-Bot-Requests: Creation of bot for Operations - https://phabricator.wikimedia.org/T142362#2532419 (10Aklapper) p:05Triage>03Normal a:03Aklapper [11:30:17] _joe_: nihal? [11:31:12] (03PS1) 10Muehlenhoff: Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) [11:31:42] <_joe_> mobrovac: codfw naming [11:32:08] <_joe_> mobrovac: https://en.wikipedia.org/wiki/Beta_Leporis [11:32:15] (03CR) 10jenkins-bot: [V: 04-1] Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) (owner: 10Muehlenhoff) [11:32:26] ah ok! [11:32:33] that explains it [11:33:24] (03PS2) 10Muehlenhoff: Generate stats for monthly package upgrade activity [puppet] - 10https://gerrit.wikimedia.org/r/303531 (https://phabricator.wikimedia.org/T116742) [11:34:53] 06Operations, 10DBA, 06Performance-Team, 07Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#2532445 (10Danny_B) [11:39:04] _joe_: would you have some time this week to take a look at the puppet compiler-for-labs patch? (https://gerrit.wikimedia.org/r/#/c/297902/) [11:40:14] I'm happy to babysit the actual deployment (although I don't have +2 on that repo), but a second pair of eyes is very welcome. [11:45:17] !log upgrading firejail on scb* in codfw [11:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:45:58] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [11:49:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [11:55:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [12:02:55] 06Operations, 06Project-Admins, 05Puppet-infrastructure-modernization: Create a Goal project for technical operations Q1 goal "Modernize Puppet configuration management infrastructure" - https://phabricator.wikimedia.org/T139728#2532868 (10Danny_B) [12:03:04] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 2 others: Enable access to relforge clusters from virtual machines running on labs - https://phabricator.wikimedia.org/T142211#2532869 (10Gehel) Could someone in #netops open traffic from labs-instance to the 2 relforge servers on... [12:04:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:07:27] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: Add cscott to WMF_NDA. - https://phabricator.wikimedia.org/T87479#2532880 (10Danny_B) [12:07:45] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#2532885 (10Danny_B) [12:19:29] !log upgrading firejail on scb1001 (along with service restarts except changeprop) [12:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:01] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 2 others: Enable access to relforge clusters from virtual machines running on labs - https://phabricator.wikimedia.org/T142211#2532907 (10Gehel) I realize that some context is probably missing here. The detailed discussion about th... [12:29:05] <_joe_> valhallasw`cloud: will do! [12:42:07] (03PS1) 10Muehlenhoff: statsd proxy: Limit to production networks [puppet] - 10https://gerrit.wikimedia.org/r/303532 [12:49:04] 06Operations, 10DBA, 06Performance-Team, 07Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#2532919 (10hoo) Is this still relevant? Might have been fixed with {T125838}. [12:51:57] 06Operations, 10DBA, 06Performance-Team, 07Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#2532922 (10jcrespo) 05Open>03Resolved a:03jcrespo This is not ongoing. [12:52:14] 06Operations, 10Traffic: Varnish 4 stalls with two consecutive Range requests using HTTP persistent connections - https://phabricator.wikimedia.org/T142233#2532925 (10ema) Issue [[https://github.com/varnishcache/varnish-cache/issues/2035 | submitted upstream]]. [12:52:16] !log upgrading firejail on scb1002 (along with service restarts except changeprop) [12:52:18] (03PS1) 10Ladsgroup: Add 'extendedconfirmed' user group for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303535 (https://phabricator.wikimedia.org/T140839) [12:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:58:06] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [13:00:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [13:08:00] 06Operations, 10vm-requests: eqiad+codfw: 1 VM request for puppetDB - https://phabricator.wikimedia.org/T142365#2532972 (10Joe) a:03Joe [13:11:17] 06Operations, 10ops-eqiad: Investigate strontium disk issues on 2016-08-05 - https://phabricator.wikimedia.org/T142187#2532974 (10Cmjohnson) The 2 disks are most likely failed. The server is from the original build and should be decommissioned and a new misc server allocated to replace the server. Linking this... [13:17:49] (03PS1) 10Giuseppe Lavagetto: install_server: add dhcp entries for the puppetdb VMs [puppet] - 10https://gerrit.wikimedia.org/r/303538 (https://phabricator.wikimedia.org/T142365) [13:17:51] (03PS1) 10Giuseppe Lavagetto: install_server: sort dhcp entries alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/303539 [13:22:55] (03CR) 10Giuseppe Lavagetto: [C: 032] install_server: add dhcp entries for the puppetdb VMs [puppet] - 10https://gerrit.wikimedia.org/r/303538 (https://phabricator.wikimedia.org/T142365) (owner: 10Giuseppe Lavagetto) [13:23:01] (03PS2) 10Giuseppe Lavagetto: install_server: add dhcp entries for the puppetdb VMs [puppet] - 10https://gerrit.wikimedia.org/r/303538 (https://phabricator.wikimedia.org/T142365) [13:24:15] 06Operations: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2532989 (10Cmjohnson) @akosiaris ganeti1001 disks have been replaced. [13:24:29] 06Operations: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2532990 (10Cmjohnson) [13:25:32] 06Operations, 10hardware-requests: EQIAD: (2) hardware access request for PUPPET - https://phabricator.wikimedia.org/T142218#2532992 (10akosiaris) As pointed out by @Cmjohnson in T142187, strontium's disks are dead. This task is for the refreshing of that system alongside palladium and is part of our quarterl... [13:29:39] 06Operations, 10vm-requests, 05Puppet-infrastructure-modernization: eqiad+codfw: 1 VM request for puppetDB - https://phabricator.wikimedia.org/T142365#2533027 (10Joe) p:05Triage>03High [13:30:01] (03PS1) 10Giuseppe Lavagetto: install_server: add partman entries for nihal/nitrogen [puppet] - 10https://gerrit.wikimedia.org/r/303542 (https://phabricator.wikimedia.org/T142365) [13:30:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] install_server: add partman entries for nihal/nitrogen [puppet] - 10https://gerrit.wikimedia.org/r/303542 (https://phabricator.wikimedia.org/T142365) (owner: 10Giuseppe Lavagetto) [13:32:05] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533032 (10MoritzMuehlenhoff) [13:36:15] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 2 others: Enable access to relforge clusters from virtual machines running on labs - https://phabricator.wikimedia.org/T142211#2527394 (10BBlack) We don't in general open firewall holes between prod and labs like that. This may ne... [13:42:40] 06Operations, 06Services, 06Services-next, 05Security: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2533056 (10Lea_WMDE) [13:44:49] 06Operations, 10Wikimedia-Logstash: Elasticsearch restarts are failing in the logstash cluster - https://phabricator.wikimedia.org/T142357#2533058 (10Gehel) Uploaded diagnostic dump after reviewing it with @dcausse to ensure no confidential info is present: {F4345184} [13:45:50] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533061 (10MoritzMuehlenhoff) We could implement this as such: Initially we need to create a PGP key for the encrypted passwords. The public key would be distributed to the VMs via puppet. The secret key would b... [13:53:14] (03CR) 10Ottomata: [C: 031] druid: Limit to analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/303144 (owner: 10Muehlenhoff) [13:53:51] (03CR) 10Alexandros Kosiaris: "16:09:17 Build timed out (after 30 minutes). Marking the build as failed." [debs/contenttranslation/giella-sme] - 10https://gerrit.wikimedia.org/r/294430 (https://phabricator.wikimedia.org/T120087) (owner: 10KartikMistry) [13:54:24] 06Operations, 06Services, 06Services-next, 05Security: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2533068 (10mobrovac) For the deployment part, I guess we'll need to follow [our own guidelines](https://wikitech.wikimedia.org/wiki/Ser... [13:55:14] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/294673 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:55:16] (03PS2) 10Giuseppe Lavagetto: install_server: sort dhcp entries alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/303539 [13:56:19] (03PS2) 10KartikMistry: apertium-pt-ca: Rebuild for Jessie, cleanup. [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/296164 (https://phabricator.wikimedia.org/T107306) [13:57:11] (03PS2) 10KartikMistry: apertium-fr-es: New upstream and rebuild for Jessie [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/295220 (https://phabricator.wikimedia.org/T107306) [13:57:38] (03CR) 10Giuseppe Lavagetto: [C: 032] install_server: sort dhcp entries alphabetically [puppet] - 10https://gerrit.wikimedia.org/r/303539 (owner: 10Giuseppe Lavagetto) [13:57:41] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/294425 (https://phabricator.wikimedia.org/T137768) (owner: 10KartikMistry) [13:58:04] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-hbs] - 10https://gerrit.wikimedia.org/r/294675 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [13:59:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "15:42:33 gbp:error: upstream/0.1.0_r57598 is not a valid treeish" [debs/contenttranslation/apertium-hbs-eng] - 10https://gerrit.wikimedia.org/r/296049 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [14:02:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] "15:42:29 gbp:error: upstream/0.5.0_r59294 is not a valid treeish" [debs/contenttranslation/apertium-hbs-slv] - 10https://gerrit.wikimedia.org/r/296203 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [14:04:34] PROBLEM - parsoid on wtp2018 is CRITICAL: Connection refused [14:04:54] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: puppet fail [14:06:34] RECOVERY - parsoid on wtp2018 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.083 second response time [14:06:54] RECOVERY - puppet last run on wtp2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:06:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "15:42:25 dpkg-source: info: local changes detected, the modified files are:" [debs/contenttranslation/apertium-hin] - 10https://gerrit.wikimedia.org/r/296228 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [14:07:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "15:41:58 dpkg-source: error: cannot represent change to tests/morphotactics/PRON-REF.txt.gz: binary file contents changed" [debs/contenttranslation/apertium-kaz] - 10https://gerrit.wikimedia.org/r/296366 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [14:08:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "15:41:18 dpkg-source: info: local changes detected, the modified files are:" [debs/contenttranslation/apertium-nob] - 10https://gerrit.wikimedia.org/r/269914 (https://phabricator.wikimedia.org/T124317) (owner: 10KartikMistry) [14:10:16] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/294252 (https://phabricator.wikimedia.org/T137768) (owner: 10KartikMistry) [14:11:01] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-fr-es: New upstream and rebuild for Jessie [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/295220 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [14:12:25] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-pt-ca: Rebuild for Jessie, cleanup. [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/296164 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [14:13:56] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/296050 (https://phabricator.wikimedia.org/T107306) (owner: 10KartikMistry) [14:17:17] (03PS1) 10Ema: admin: add shell account for Niharika Kohli [puppet] - 10https://gerrit.wikimedia.org/r/303543 (https://phabricator.wikimedia.org/T141593) [14:27:08] (03PS1) 10Ottomata: Remove extraneous eventlogging configuration [puppet] - 10https://gerrit.wikimedia.org/r/303544 [14:33:21] Wow… reponse time between API servers differ by a factor of 5 [14:35:28] (03PS2) 10Ema: admin: add shell account for Niharika Kohli [puppet] - 10https://gerrit.wikimedia.org/r/303543 (https://phabricator.wikimedia.org/T141593) [14:37:04] (03PS2) 10Faidon Liambotis: remove temporary fundraising DKIM record from wikipedia.org zone template [dns] - 10https://gerrit.wikimedia.org/r/303159 (https://phabricator.wikimedia.org/T135410) (owner: 10Jgreen) [14:37:39] (03CR) 10Faidon Liambotis: [C: 032] remove temporary fundraising DKIM record from wikipedia.org zone template [dns] - 10https://gerrit.wikimedia.org/r/303159 (https://phabricator.wikimedia.org/T135410) (owner: 10Jgreen) [14:38:51] (03PS1) 10Muehlenhoff: Support scaling of huge SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303548 (https://phabricator.wikimedia.org/T111815) [14:44:43] (03CR) 10Faidon Liambotis: [C: 032] "Really nice work." [puppet] - 10https://gerrit.wikimedia.org/r/303147 (https://phabricator.wikimedia.org/T142085) (owner: 10Volans) [14:46:21] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533204 (10Andrew) @MoritzMuehlenhoff -- that sounds just like what I was envisioning. Would you be willing to spoon-feed me the commands I need (e.g. generating a key, encrypting/decrypting, etc)? [14:47:47] (03PS4) 10Volans: Monitoring: Add NRPE commands to get RAID status [puppet] - 10https://gerrit.wikimedia.org/r/303147 (https://phabricator.wikimedia.org/T142085) [14:48:53] volans: I left some comments there, not sure if you saw them :) [14:50:12] paravoid: yes, too kind :) I was just rebasing now [14:52:17] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2527493 (10faidon) Generating the root password locally and passing the (encrypted) root password over the puppet logs doesn't sound that great to me. Why wouldn't we generate the cleartext password on the pup... [14:52:20] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 07User-notice: SVG files larger than 10 MB cannot be thumbnailed - https://phabricator.wikimedia.org/T111815#2533252 (10Jdforrester-WMF) [14:56:26] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533340 (10Andrew) @faidon, my design was based on things I already know how to do :) You're suggesting puppet gymnastics that I don't have any experience with, can you be more specific? (Also, for what it's w... [14:57:22] (03PS1) 10Ottomata: 1.3.0 release [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/303550 [15:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160808T1500). [15:00:05] James_F and Amir1: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] * James_F waves [15:00:25] hey [15:01:36] I can SWAT today. [15:02:20] (03PS4) 10Thcipriani: MoodBar: Disable on all wikis except nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301893 (https://phabricator.wikimedia.org/T131340) (owner: 10Jforrester) [15:02:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301893 (https://phabricator.wikimedia.org/T131340) (owner: 10Jforrester) [15:02:58] (03Merged) 10jenkins-bot: MoodBar: Disable on all wikis except nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301893 (https://phabricator.wikimedia.org/T131340) (owner: 10Jforrester) [15:04:47] James_F: ^ is live on mw1099, check please [15:05:31] thcipriani: Yup, it's gone from frwikisource but not nlwiki, as expected. [15:05:32] PROBLEM - parsoid on wtp2007 is CRITICAL: Connection refused [15:05:43] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: puppet fail [15:05:44] ack, ok, going live everywhere. [15:05:53] PROBLEM - parsoid on wtp2014 is CRITICAL: Connection refused [15:06:13] PROBLEM - parsoid on wtp2005 is CRITICAL: Connection refused [15:06:13] PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: puppet fail [15:06:33] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: puppet fail [15:06:43] PROBLEM - parsoid on wtp2010 is CRITICAL: Connection refused [15:07:02] PROBLEM - parsoid on wtp2020 is CRITICAL: Connection refused [15:07:02] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: puppet fail [15:07:14] PROBLEM - parsoid on wtp2016 is CRITICAL: Connection refused [15:07:22] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: puppet fail [15:07:33] PROBLEM - parsoid on wtp2009 is CRITICAL: Connection refused [15:07:33] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: puppet fail [15:07:36] codfw parsoid was just moved to node 4.x, right? Is this expected? [15:07:53] PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: puppet fail [15:07:56] (03CR) 10Ottomata: [C: 032] 1.3.0 release [debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/303550 (owner: 10Ottomata) [15:08:00] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:301893|MoodBar: Disable on all wikis except nlwiki (T131340)]] (duration: 01m 08s) [15:08:01] T131340: De-deploy MoodBar from WMF wikis - https://phabricator.wikimedia.org/T131340 [15:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:08:06] ^ James_F live everywhere [15:08:20] thcipriani: Yup, LGTM. [15:08:31] awesome, thanks for checking. [15:09:21] (03PS3) 10Thcipriani: Simplify the VE RB URL config some more, now that we no longer use wgServerName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294713 (owner: 10Alex Monk) [15:09:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294713 (owner: 10Alex Monk) [15:09:44] RECOVERY - parsoid on wtp2014 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.098 second response time [15:10:04] (03Merged) 10jenkins-bot: Simplify the VE RB URL config some more, now that we no longer use wgServerName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/294713 (owner: 10Alex Monk) [15:10:13] RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:11:06] James_F: https://gerrit.wikimedia.org/r/#/c/294713 is live on mw1099, check please [15:11:14] * James_F is doing so. [15:11:23] RECOVERY - parsoid on wtp2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.116 second response time [15:11:26] Yup, LGTM. [15:11:34] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:11:41] okie doke, going live everywhere [15:11:48] Ta. [15:12:08] (03CR) 10Ottomata: [C: 032] Remove extraneous eventlogging configuration [puppet] - 10https://gerrit.wikimedia.org/r/303544 (owner: 10Ottomata) [15:12:39] (03PS1) 10Muehlenhoff: Inline firejail profile no longer shipped in firejail 0.9.40 [puppet] - 10https://gerrit.wikimedia.org/r/303553 (https://phabricator.wikimedia.org/T121756) [15:12:55] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-fr-es_0.9.2~r61322-1+wmf1 [15:12:55] !log T107306 uploaded to apt.wikimedia.org jessie-wikimedia: apertium-pt-ca_0.8.2+svn~57507-1+wmf1 [15:12:56] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:12:57] T107306: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306 [15:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:22] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:294713|Simplify the VE RB URL config some more, now that we no longer use wgServerName]] (duration: 00m 48s) [15:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:30] ^ James_F live everywhere now [15:14:04] RECOVERY - parsoid on wtp2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.132 second response time [15:14:08] Thanks! [15:14:23] RECOVERY - puppet last run on wtp2005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:14:53] RECOVERY - parsoid on wtp2020 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.090 second response time [15:15:00] (03PS2) 10Thcipriani: Add 'extendedconfirmed' user group for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303535 (https://phabricator.wikimedia.org/T140839) (owner: 10Ladsgroup) [15:15:13] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303535 (https://phabricator.wikimedia.org/T140839) (owner: 10Ladsgroup) [15:15:55] (03Merged) 10jenkins-bot: Add 'extendedconfirmed' user group for fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303535 (https://phabricator.wikimedia.org/T140839) (owner: 10Ladsgroup) [15:16:42] RECOVERY - parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.153 second response time [15:16:54] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:17:13] RECOVERY - parsoid on wtp2016 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.091 second response time [15:17:30] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533422 (10faidon) So, a simple way to do this would be: ``` user { 'root': password => generate('/usr/local/bin/password_for_labs', $fqdn) } ``` Where `/usr/local/bin/password_for_labs` could be a script i... [15:17:33] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:17:33] Amir1: extendedconfirm patch is live on mw1099, please check [15:17:46] sure, on it [15:17:53] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 185 bytes in 0.581 second response time [15:19:35] the test might take some time [15:19:44] (because it's lots of places) [15:20:04] ack, np [15:20:41] it's okay to go live now [15:21:07] ottomata: I think you just broke puppet on labs instances [15:21:15] uh oh [15:21:16] Amir1: okie doke, doing [15:21:17] https://www.irccloud.com/pastebin/yLrY2MH7/ [15:21:24] RECOVERY - parsoid on wtp2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.176 second response time [15:21:24] oh i just merged something i had recently cherry picked [15:21:29] need to un rebase [15:21:29] ottomata: we're in a meeting, but can you investigate that paste? [15:21:32] on all labs?! [15:21:33] OH NO [15:21:34] ascii! [15:21:35] ok [15:21:37] Might not be you, just a stab in the dark [15:21:39] thanks :) [15:21:43] RECOVERY - puppet last run on wtp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:21:58] andrewbogott: Who cares about Labs, eh? ;-) [15:22:37] looking [15:23:40] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:303535|Add extendedconfirmed user group for fawiki (T140839)]] (duration: 00m 51s) [15:23:42] T140839: Add 'extendedconfirmed' right, group, and restriction on fa.wikipedia - https://phabricator.wikimedia.org/T140839 [15:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:47] ^ Amir1 live everywhere now [15:23:56] thanks [15:25:52] found it [15:26:00] option space so easily puts those in! [15:26:10] wonder if we can get jenkins to check for non ascii chars [15:27:45] I'm surprised puppetlint doesn't complain, to be honest [15:28:03] probably locale settings that are slightly different [15:28:42] andrewbogott: fixed and merged [15:28:45] 06Operations, 10hardware-requests: Replace/refresh carbon - https://phabricator.wikimedia.org/T137117#2533476 (10RobH) 05Open>03Resolved This system has arrived on site and is being setup via T139171. [15:28:49] ottomata: thanks! [15:29:16] 06Operations, 10hardware-requests: Replace/refresh carbon - https://phabricator.wikimedia.org/T137117#2533480 (10RobH) Or should this stay open for replacing the other aspects of carbon? (The install server side.) [15:29:53] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.620 second response time [15:30:31] 06Operations, 06Discovery, 06Maps: Configure LVS in front of maps100? servers - https://phabricator.wikimedia.org/T142393#2533482 (10Gehel) [15:30:54] (03PS1) 10Gehel: LVS configuration for maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/303559 (https://phabricator.wikimedia.org/T142393) [15:32:25] 06Operations, 06Operations-Software-Development: Package Python phabricator module for both Ubuntu Precise and Debian Jessie - https://phabricator.wikimedia.org/T142097#2533499 (10Volans) I've found that on copper (package builder host) the `libdistro-info-perl` package is missing. Sending a patch to fix it. [15:32:43] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [15:33:25] ottomata: looks better, my test vm is working now. Thanks again [15:33:34] (03PS1) 10Volans: Package builder: ensure libdistro-info-perl is installed [puppet] - 10https://gerrit.wikimedia.org/r/303560 (https://phabricator.wikimedia.org/T142097) [15:34:18] who needs perl [15:34:23] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [15:35:00] mark: apparently only perl knows the Debian/Ubuntu release names ;) [15:36:34] PROBLEM - parsoid on wtp2004 is CRITICAL: Connection refused [15:36:53] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: puppet fail [15:40:49] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2533527 (10faidon) BTW, [[ https://github.com/duritong/trocla | trocla ]] (and [[ https://github.com/duritong/puppet-trocla | puppet-trocla ]]) might also be of interest here. I haven't given it a close look, bu... [15:42:14] (03PS1) 10Gehel: Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303561 [15:42:41] akosiaris: Thanks for all your work on T135176; can it now be closed [15:42:41] T135176: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176 [15:43:51] (03CR) 10BryanDavis: Do not force merge (optimize) indices on logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303561 (owner: 10Gehel) [15:45:06] (03PS2) 10Gehel: Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303561 [15:45:32] (03CR) 10Gehel: Do not force merge (optimize) indices on logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303561 (owner: 10Gehel) [15:45:56] (03CR) 10BryanDavis: [C: 031] Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303561 (owner: 10Gehel) [15:46:58] (03CR) 10Gehel: [C: 032] Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303561 (owner: 10Gehel) [15:50:59] James_F: no, not yet. I am still doing the CODFW part [15:51:34] akosiaris: Ah, sorry. Thought the flapping earlier was it. :-) [15:51:39] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#2533560 (10mark) [15:51:57] ACKNOWLEDGEMENT - puppet last run on logstash1001 is CRITICAL: CRITICAL: Puppet has 1 failures Gehel Issue with https://gerrit.wikimedia.org/r/#/c/303561/ fix coming up [15:51:57] ACKNOWLEDGEMENT - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures Gehel Issue with https://gerrit.wikimedia.org/r/#/c/303561/ fix coming up [15:51:57] ACKNOWLEDGEMENT - puppet last run on logstash1003 is CRITICAL: CRITICAL: Puppet has 1 failures Gehel Issue with https://gerrit.wikimedia.org/r/#/c/303561/ fix coming up [15:53:27] 06Operations, 10Cassandra: xenon.eqiad.wmnet: very high cpu utilization - https://phabricator.wikimedia.org/T141675#2533562 (10Eevans) @Dzahn, @fgiunchedi I'm wondering: Should we go ahead and blacklist the `acpi_pad` module (as @fgiunchedi suggested in {T123924}), or close this issue (and T123924) and pretend... [15:55:04] (03PS1) 10Gehel: Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303563 [15:55:27] (03PS1) 10Merlijn van Deen: ops/puppet CI: check for non-ascii .pp files [puppet] - 10https://gerrit.wikimedia.org/r/303564 [15:55:50] * valhallasw`cloud is unsure whether jenkins will actually try to run that [15:56:21] (03PS2) 10Volans: Package builder: ensure libdistro-info-perl is installed [puppet] - 10https://gerrit.wikimedia.org/r/303560 (https://phabricator.wikimedia.org/T142097) [15:58:18] (03PS2) 10Merlijn van Deen: ops/puppet CI: check for non-ascii .pp files [puppet] - 10https://gerrit.wikimedia.org/r/303564 [15:58:23] RECOVERY - parsoid on wtp2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.149 second response time [15:58:33] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2533581 (10Eevans) During last weeks Ops-Services sync meeting, there was some mention of (soon-to-be )decommissioned Varnish machines in esams that we might... [15:58:43] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:58:44] (03PS3) 10Volans: Package builder: ensure libdistro-info-perl is installed [puppet] - 10https://gerrit.wikimedia.org/r/303560 (https://phabricator.wikimedia.org/T142097) [15:58:51] <_joe_> valhallasw`cloud: uhm, I don't think we really need that [15:59:04] <_joe_> IIRC, it's just a stupid setting to change [15:59:19] (03CR) 10jenkins-bot: [V: 04-1] ops/puppet CI: check for non-ascii .pp files [puppet] - 10https://gerrit.wikimedia.org/r/303564 (owner: 10Merlijn van Deen) [16:00:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Why check for non-ascii files? I'd check for non-utf8 files instead; the issue with puppet and non-ascii files is just matter of properly " [puppet] - 10https://gerrit.wikimedia.org/r/303564 (owner: 10Merlijn van Deen) [16:01:08] (03CR) 10Gehel: [C: 032] Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303563 (owner: 10Gehel) [16:01:33] (03CR) 10Alexandros Kosiaris: [C: 031] Package builder: ensure libdistro-info-perl is installed [puppet] - 10https://gerrit.wikimedia.org/r/303560 (https://phabricator.wikimedia.org/T142097) (owner: 10Volans) [16:03:45] (03CR) 10Merlijn van Deen: "I'd also be happy with someone fixing the labs puppetmasters (which includes several non-centralised ones). However, the current situation" [puppet] - 10https://gerrit.wikimedia.org/r/303564 (owner: 10Merlijn van Deen) [16:04:10] 06Operations, 06Services, 06Services-next, 05Security: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2533599 (10GWicke) [16:05:00] PROBLEM - Check size of conntrack table on wtp2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:05:19] PROBLEM - DPKG on wtp2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:05:30] PROBLEM - Disk space on wtp2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:05:40] PROBLEM - MD RAID on wtp2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:06:30] PROBLEM - configured eth on wtp2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:06:50] PROBLEM - dhclient process on wtp2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:06:54] <_joe_> valhallasw`cloud: I am willing to fix that, I am pretty sure we fixed it for labs puppetmasters too [16:07:00] <_joe_> not sure about self-hosted ones [16:07:09] PROBLEM - parsoid on wtp2003 is CRITICAL: Connection refused [16:07:20] PROBLEM - puppet last run on wtp2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:07:40] PROBLEM - salt-minion processes on wtp2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:08:17] _joe_: odd -- tools hosts on the central labs puppetmaster broke due to that non-ascii character [16:08:24] (03PS1) 10Yuvipanda: labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) [16:09:24] (03CR) 10Giuseppe Lavagetto: "FTR, issue was tracked here: https://tickets.puppetlabs.com/browse/PUP-1386" [puppet] - 10https://gerrit.wikimedia.org/r/303564 (owner: 10Merlijn van Deen) [16:09:34] (03PS1) 10Gehel: Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303567 [16:10:12] (03PS1) 10Ottomata: Eventlogging analytics kafka group ids have changed. Fix burrow lag monitoring [puppet] - 10https://gerrit.wikimedia.org/r/303568 [16:13:26] (03CR) 10Ottomata: [C: 032] Eventlogging analytics kafka group ids have changed. Fix burrow lag monitoring [puppet] - 10https://gerrit.wikimedia.org/r/303568 (owner: 10Ottomata) [16:15:01] (03CR) 10BryanDavis: [C: 031] "`sudo crontab -l` shows the crons have been removed on logstash100[1-3]" [puppet] - 10https://gerrit.wikimedia.org/r/303567 (owner: 10Gehel) [16:16:04] (03PS2) 10Gehel: Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303567 [16:17:54] (03PS2) 10Yuvipanda: labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) [16:18:33] (03CR) 10Gehel: [C: 032] Do not force merge (optimize) indices on logstash [puppet] - 10https://gerrit.wikimedia.org/r/303567 (owner: 10Gehel) [16:22:08] 06Operations, 06Editing-Analysis: Connection time out to stat1003 - https://phabricator.wikimedia.org/T142126#2533713 (10HJiang-WMF) My ssh config file is here: ForwardAgent no Host bast4001.wikimedia.org # Direct connection for the bastion host ProxyCommand none ControlMaster auto Host *.wikim... [16:24:08] (03PS4) 10Volans: Package builder: ensure libdistro-info-perl is installed [puppet] - 10https://gerrit.wikimedia.org/r/303560 (https://phabricator.wikimedia.org/T142097) [16:25:15] bblack: should we populate the cache-status (hit /miss) with the regex you gave us: [in this order: /hit/ => hit, /int/ => int, /pass,[^,]+$/ => pass, /miss/ => miss, else it's unknown (a bug?)] [16:26:24] (03CR) 10Volans: [C: 032] Package builder: ensure libdistro-info-perl is installed [puppet] - 10https://gerrit.wikimedia.org/r/303560 (https://phabricator.wikimedia.org/T142097) (owner: 10Volans) [16:27:04] (03PS3) 10Yuvipanda: labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) [16:30:16] (03PS4) 10Yuvipanda: labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) [16:30:20] RECOVERY - configured eth on wtp2003 is OK: OK - interfaces up [16:30:40] RECOVERY - dhclient process on wtp2003 is OK: PROCS OK: 0 processes with command name dhclient [16:30:40] RECOVERY - Check size of conntrack table on wtp2003 is OK: OK: nf_conntrack is 0 % full [16:30:50] RECOVERY - parsoid on wtp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.105 second response time [16:31:00] RECOVERY - DPKG on wtp2003 is OK: All packages OK [16:31:11] RECOVERY - puppet last run on wtp2003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:31:19] RECOVERY - Disk space on wtp2003 is OK: DISK OK [16:31:29] RECOVERY - MD RAID on wtp2003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:31:30] RECOVERY - salt-minion processes on wtp2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:32:35] nuria_: we could do that in VCL and varnishkafka config. I've been hesitant to mess with it because I don't want to break anything for hadoop by introducing a new set of values. [16:32:55] (03CR) 10jenkins-bot: [V: 04-1] labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) (owner: 10Yuvipanda) [16:33:23] nuria_: but if you think it's safe to do so, it's not fundmentally difficult. [16:34:46] (03CR) 10Faidon Liambotis: [C: 04-1] labstore: Configure LDAP failover + timeout for create-dbuser (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) (owner: 10Yuvipanda) [16:36:53] (03PS1) 10MarcoAurelio: IP cap lift for UN Women Editathon in NYC, Aug 12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303571 (https://phabricator.wikimedia.org/T142396) [16:46:58] (03PS5) 10Yuvipanda: labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) [16:46:58] 06Operations, 06Discovery, 10Elasticsearch, 10netops, 03Discovery-Search-Sprint: Enable access to relforge clusters from virtual machines running on labs - https://phabricator.wikimedia.org/T142211#2533836 (10ksmith) [16:49:38] 06Operations, 10Ops-Access-Requests, 10Analytics: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2522728 (10Ottomata) This was talked about in ops meeting today. Ops would prefer that we create another group, `deploy-aqs` perhaps,... [16:50:10] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [16:50:57] 06Operations, 10Ops-Access-Requests, 10Analytics: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2533860 (10Ottomata) Hm, there is already an `aqs-users` group. Should we reuse it for this? [16:51:00] 06Operations, 10Analytics, 10Monitoring: Switch jmxtrans from statsd to graphite line protocol - https://phabricator.wikimedia.org/T73322#772448 (10Milimetric) @elukey we put this on Q2, let me know if it should be earlier. [16:52:10] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [16:52:48] (03PS3) 10MarcoAurelio: Cleanup of IP throttles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303312 [16:53:25] !log T140008: Starting major compaction, restbase2001-a.codfw.wmnet [16:53:26] T140008: High RESTBase storage utilization - https://phabricator.wikimedia.org/T140008 [16:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:11] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2003.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:14] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2004.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:16] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2005.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:17] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2006.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:19] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2007.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:22] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2008.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:24] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2009.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:26] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2010.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:28] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2011.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:30] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2012.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:32] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2013.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:34] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2014.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:35] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2015.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:37] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2016.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:39] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2017.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:41] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2018.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:42] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2019.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:44] !log akosiaris@palladium conftool action : set/pooled=yes; selector: wtp2020.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [16:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:52] (03PS1) 10Nemo bis: Finish removing MoodBar, including nl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303573 (https://phabricator.wikimedia.org/T131340) [16:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:07] (03PS4) 10MarcoAurelio: User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) [16:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:20] (03PS4) 10MarcoAurelio: Rename 'autoreview' to 'autopatrolled' on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302987 [16:56:23] (03PS3) 10MarcoAurelio: Cleaning a bit frwiktionary's botadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 [16:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:25] (03PS2) 10MarcoAurelio: Remove 'centralauth-usermerge' from stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303150 [16:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:41] Bot wars II. - Attack of the bots [16:57:00] (03Abandoned) 10Jforrester: Finish removing MoodBar, including nl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303573 (https://phabricator.wikimedia.org/T131340) (owner: 10Nemo bis) [16:59:17] (03CR) 10MarcoAurelio: [C: 031] De-deploy the MoodBar extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280624 (https://phabricator.wikimedia.org/T131340) (owner: 10Catrope) [16:59:24] (03PS4) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 [16:59:40] (03CR) 10jenkins-bot: [V: 04-1] Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 (owner: 10ArielGlenn) [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160808T1700). Please do the needful. [17:00:04] gehel and SMalyshev: A patch you scheduled for Weekly Wikidata query service deployment window is about to be deployed. Please be available during the process. [17:00:14] o/ [17:00:40] (03CR) 10Nemo bis: "Would be easier to abandon the other way, since the other patch doesn't merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303573 (https://phabricator.wikimedia.org/T131340) (owner: 10Nemo bis) [17:01:20] 06Operations, 06Services, 06Services-next, 05Security: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2533950 (10MoritzMuehlenhoff) Chrome/Chromium has a very fast-moving release cycle with updates every few weeks (and sometimes even wit... [17:02:25] bblack: populating the right info in cache-status should be fine as field is just a string, it could have any value. if it is easy to send the right value on your end. [17:03:49] !log akosiaris@palladium conftool action : set/pooled=no; selector: wtp2001.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [17:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:50] 06Operations, 06Services, 06Services-next, 05Security: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2533969 (10GWicke) @MoritzMuehlenhoff, that's one of the reasons why Electron is distributing their own binary build of the exact Chrom... [17:06:50] nuria_: ok, will look at it today [17:08:16] bblack: https://phabricator.wikimedia.org/T142410 [17:08:37] 06Operations, 10Analytics, 10Traffic: Correct cache_status field on webrequest dataset - https://phabricator.wikimedia.org/T142410#2533998 (10Nuria) [17:09:03] !log updating WDQS to latest and service restart [17:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:10:38] SMalyshev: wdqs-test and wdqs prod updated, feel free to test... [17:10:52] SMalyshev: looks good from my point of view [17:11:48] (03PS6) 10Yuvipanda: labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) [17:12:41] robh moritzm alerted me about https://phabricator.wikimedia.org/T137924#2525923 earlier. anything I need to do to get that decomm'd? [17:13:24] do you have any data on it needing saving or can i just reclaim and wipe? [17:14:02] if you dont need anythign off it, i can make the task and take care of it. [17:15:33] (03PS5) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 [17:21:37] robh yeah, you can wipe [17:21:45] cool, i'll take care of it [17:21:50] robh <3 thank you [17:21:51] (well, the tasks and such ;) [17:22:01] ha, robh! [17:22:09] i had a question [17:22:44] (03PS1) 10BBlack: VCL: remove legacy X-Cache handling [puppet] - 10https://gerrit.wikimedia.org/r/303577 [17:22:46] (03PS1) 10BBlack: VCL: emit X-Cache-Status response header [puppet] - 10https://gerrit.wikimedia.org/r/303578 (https://phabricator.wikimedia.org/T142410) [17:22:48] (03PS1) 10BBlack: vk webrequest: use X-Cache-Status for cache_status [puppet] - 10https://gerrit.wikimedia.org/r/303579 (https://phabricator.wikimedia.org/T142410) [17:23:11] robh: https://phabricator.wikimedia.org/project/profile/1660/ vs. https://phabricator.wikimedia.org/T118176#2221652 [17:23:45] 06Operations, 10hardware-requests: reclaim WMF4724 to spares - https://phabricator.wikimedia.org/T142412#2534075 (10RobH) [17:23:59] Danny_B: What is your question? [17:24:30] right now we're handling it in mailman, so im guessing you are askign why the project exists? (but not sure) [17:24:46] the project seems to be inactive. then i saw the linked discussion. so can it be archived? [17:25:12] I think so yes, but lemme grep around and make sure [17:25:17] thx [17:25:33] i didnt recall it existed, thx for pointing out =] [17:25:42] (no tasks hit it, so it never shows in my feed of alerts) [17:25:56] the board is empty [17:26:15] yuvipanda: just https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission I guess [17:26:25] (but if tasks there use private space then i just don't see them) [17:26:28] yeah, its now archived [17:26:35] I noticed that host on servermon in where it was shown as in need of updates [17:26:44] there was a private space involved too, im handling it (there werent tasks in it though) [17:27:35] thank you. re the space: please update https://phabricator.wikimedia.org/T138677 if something relevant [17:27:41] simplifying our phab projects/spaces ftw. [17:28:47] 06Operations, 06Editing-Analysis: Connection time out to stat1003 - https://phabricator.wikimedia.org/T142126#2534098 (10HJiang-WMF) 05Open>03Resolved a:03HJiang-WMF Issue resolved. Thanks everyone! Turns out that I accidentally overwrote the alias for bast4001 and that is the root cause of the hung-up s... [17:30:29] 06Operations, 06Operations-Software-Development, 10Phabricator: Package Python phabricator module for both Ubuntu Precise and Debian Jessie - https://phabricator.wikimedia.org/T142097#2534107 (10greg) Others watching #phabricator will be interested :) Glad to see this! [17:31:14] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2534111 (10Jgreen) 05Open>03Resolved [17:32:00] 06Operations, 06Operations-Software-Development, 10Packaging, 10Phabricator: Package Python phabricator module for both Ubuntu Precise and Debian Jessie - https://phabricator.wikimedia.org/T142097#2534116 (10greg) [17:33:07] robh: lol, you overwrote my changes [17:33:10] nvm [17:35:23] ahh, sorry =[ [17:35:32] edit conflict on phab isnt as nice as mediawiki! [17:35:39] we do something better than someone else! \o/ [17:36:03] there is absolutely no notification of something happened in the meantime... phab is way so stupid in this [17:36:16] indeed, this is only the second time its happened to me but it sucks [17:36:34] i suppose they expect most folks to comment, not go editing the original task [17:36:36] robh: if doing something better shocks you, there's something wrong :P [17:36:54] SPF|Cloud: not shocked, merely rejoicing in our superiority ;D [17:37:21] to be clear, im not bashing phab overall though. i love it. [17:37:57] i love everything about our merging multiple ticketing systems into a single more cohesive service that can have more eyeballs on it. like Danny_B cleaning up our forgotten ops projects =] [17:37:59] how big is the Sun Protect Factor of the cloud? ;-) [17:42:01] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002/stat1004 for Jdlrobson - https://phabricator.wikimedia.org/T141811#2534152 (10Jdlrobson) >>! In T141811#2516851, @Ottomata wrote: > If you want webrequest access logs, then you need to be in the `analytics-privatedata-users` group. Yes I do wa... [17:42:05] robh: if you are into some operations cleanup, i have couple more small things [17:43:02] feel free to lsit and i can look [17:43:10] hrmm, someone told me how to fix the darn topic thing last week [17:43:17] and i wrote it down on a text document that i now cannot find. [17:45:09] robh: https://phabricator.wikimedia.org/T141703 -> archive H22 like H17-H21? how about H16? [17:45:41] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002/stat1004 for Jdlrobson - https://phabricator.wikimedia.org/T141811#2534160 (10dr0ptp4kt) Approved [17:45:58] I wish it showed who created the heralds and why, heh [17:46:05] so h22 is netops, i wouldnt touch that [17:46:31] unless they are fine with it. i happen to strongly dislike auto tagging projects based on text in the fields. [17:47:02] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-mobrovac: Allow *-admin groups to see systemd logs for their units - https://phabricator.wikimedia.org/T137878#2534167 (10mobrovac) 05Resolved>03Open Hm, this patch doesn't actually seem to work. All of the services' `syslog.log` files are... [17:47:19] same with h16. im pretty sure i likely archived h17-21 since i admin the onsite projects a lot [17:48:36] So yeah, I wouldn't feel comfortable making the decision to kill those entirely without some kind of operations teams input. The netops folks for the H22 or h16. [17:48:45] or h16 have the ops team know. [17:49:11] luckily the netops right now happen to be the team lead and manager for ops core and ops overall =] [17:49:13] robh: would you please handle that input? [17:50:08] text match based herald rules are imho the biggest slowdowners [17:51:04] robh: in addition to your steps, i archived https://phabricator.wikimedia.org/H113 [17:51:17] ahh, yeah [17:51:33] i'll email the tema about these and see if anyone has objections or what [17:52:13] (03PS1) 10Jforrester: Enable VisualEditor by default for logged-in users on Arabic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303586 (https://phabricator.wikimedia.org/T93387) [17:52:15] (03PS1) 10Jforrester: Enable VisualEditor by default for logged-out users on Arabic-script Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303587 (https://phabricator.wikimedia.org/T93387) [17:52:57] 06Operations, 06Editing-Analysis: Connection time out to stat1003 - https://phabricator.wikimedia.org/T142126#2534194 (10Dzahn) @HJiang-WMF Glad to hear that it works now. And thanks for using the ticket to debug this! [17:54:24] Danny_B: i emailed the ops team about those two heralds [17:54:26] we'll see [17:56:31] 06Operations, 10Cassandra: xenon.eqiad.wmnet: very high cpu utilization - https://phabricator.wikimedia.org/T141675#2534202 (10Dzahn) Probably something in between. I would say close it now and if it never happens again that is it. But if it happens again reopen and continue with the blacklisting. [17:58:34] (03PS1) 10BBlack: ciphersuite: remove DHE+chapoly-draft [puppet] - 10https://gerrit.wikimedia.org/r/303589 (https://phabricator.wikimedia.org/T131908) [17:58:36] (03PS1) 10BBlack: ciphersuite: remove CBC AES256-SHA options from "mid" [puppet] - 10https://gerrit.wikimedia.org/r/303590 [18:00:05] robh: thx [18:00:15] i dislike those herald rules so much. [18:00:26] if i want a project on a task, i'd add it in the project field! [18:00:43] robh: there is bunch #operations related rules [18:00:47] the herald rules that apply umbrella projects to other projects make more sense. [18:00:59] yeah, there are rules that add #operations to everything under it [18:01:07] i combed a bunch of them [18:01:11] combined even [18:01:16] well, "team" is obviously being considered as an "umbrella" too ;-) [18:01:20] they were all independent rules to do the same thing, heh [18:01:46] (03CR) 10BBlack: [C: 032 V: 032] ciphersuite: remove DHE+chapoly-draft [puppet] - 10https://gerrit.wikimedia.org/r/303589 (https://phabricator.wikimedia.org/T131908) (owner: 10BBlack) [18:01:50] yeah, i'm trying to consolidate them now, as they slow down phab quite significantly [18:02:50] (03CR) 10BBlack: [C: 032 V: 032] ciphersuite: remove CBC AES256-SHA options from "mid" [puppet] - 10https://gerrit.wikimedia.org/r/303590 (owner: 10BBlack) [18:03:11] I have no actual reason to think that having the single H15 rule for all operations sub-projects is faster than having a seperate rule for every sub-project [18:03:15] but its easier for humans to parse. [18:04:26] (03PS2) 10BBlack: VCL: remove legacy X-Cache handling [puppet] - 10https://gerrit.wikimedia.org/r/303577 [18:04:35] (03CR) 10BBlack: [C: 032 V: 032] VCL: remove legacy X-Cache handling [puppet] - 10https://gerrit.wikimedia.org/r/303577 (owner: 10BBlack) [18:06:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [18:07:09] ACKNOWLEDGEMENT - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 185 bytes in 1.675 second response time Yuvi Panda One instance hung, people are looking. [18:08:35] idk how about multiple vs single project matching, but as i mentioned earlier, i am pretty sure that fulltext search is the biggest slowdowner [18:08:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [18:08:52] expecially when uising really wild regexps [18:08:56] As long as it's not me. [18:09:11] heh, sorry ;-) [18:09:16] No problem ;) [18:09:28] is it actually proper english term? [18:10:25] It's fine to use [18:11:22] (03PS5) 10Volans: Monitoring: Add NRPE commands to get RAID status [puppet] - 10https://gerrit.wikimedia.org/r/303147 (https://phabricator.wikimedia.org/T142085) [18:25:17] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: Puppet has 2 failures [18:27:12] (03CR) 10Dpatrick: [C: 032] Support scaling of huge SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303548 (https://phabricator.wikimedia.org/T111815) (owner: 10Muehlenhoff) [18:29:42] 06Operations, 06Services, 07Parsoid-Tests, 15User-mobrovac: Use a different logging & metrics tag (name property) for Parsoid testing on ruthenium - https://phabricator.wikimedia.org/T141464#2534383 (10mobrovac) [18:30:13] !log restarting kafka broker on kafka1013 to test eventlogging leader rebalances [18:30:17] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: Puppet has 2 failures [18:30:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet has 1 failures [18:30:17] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: Puppet has 2 failures [18:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:46] ^^^ i see that, it's because apt is stupid. [18:33:11] Is it so critical that it is double critical? [18:33:42] it should probably be "utterly dire . . . not" or something. [18:35:17] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: Puppet has 2 failures [18:35:17] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 175 seconds ago with 0 failures [18:35:17] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [18:40:04] (03PS1) 10Mobrovac: service::node: Allow users to specify the logging name [puppet] - 10https://gerrit.wikimedia.org/r/303599 (https://phabricator.wikimedia.org/T141464) [18:40:17] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 244 seconds ago with 0 failures [18:45:48] (03CR) 10Mobrovac: "PCC loking good - https://puppet-compiler.wmflabs.org/3626/" [puppet] - 10https://gerrit.wikimedia.org/r/303599 (https://phabricator.wikimedia.org/T141464) (owner: 10Mobrovac) [18:46:08] (03PS2) 10Yuvipanda: tools: Don't include etcd in k8s master [puppet] - 10https://gerrit.wikimedia.org/r/303405 [18:46:13] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't include etcd in k8s master [puppet] - 10https://gerrit.wikimedia.org/r/303405 (owner: 10Yuvipanda) [18:46:23] (03PS7) 10Yuvipanda: labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) [18:46:28] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Configure LDAP failover + timeout for create-dbuser [puppet] - 10https://gerrit.wikimedia.org/r/303565 (https://phabricator.wikimedia.org/T142394) (owner: 10Yuvipanda) [18:48:43] !log krinkle@tin Synchronized php-1.28.0-wmf.13/extensions/WikimediaEvents/modules/ext.wikimediaEvents.rlfeature.js: T141344 Track JSON support (duration: 00m 47s) [18:48:44] T141344: Remove JSON polyfill - https://phabricator.wikimedia.org/T141344 [18:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:50] (03PS2) 10Yuvipanda: labstore: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/303604 [18:55:52] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/303604 (owner: 10Yuvipanda) [18:56:01] (03CR) 10Yuvipanda: labstore: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/303604 (owner: 10Yuvipanda) [18:56:27] (03CR) 10GWicke: "Is there a reason to configure the log name separately from the service `name` property? The latter should be the default for both logging" [puppet] - 10https://gerrit.wikimedia.org/r/303599 (https://phabricator.wikimedia.org/T141464) (owner: 10Mobrovac) [18:58:45] (03PS3) 10Yuvipanda: labstore: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/303604 [18:59:10] (03CR) 10Mobrovac: "Not sure I understand the question. The name is drawn from $title, but, as stated in the linked bug, the logging name and statsd prefix ne" [puppet] - 10https://gerrit.wikimedia.org/r/303599 (https://phabricator.wikimedia.org/T141464) (owner: 10Mobrovac) [19:00:23] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/303604 (owner: 10Yuvipanda) [19:00:29] !log restarting kafka broker on 1013 to test more eventlogging rebalances [19:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:05:15] (03PS13) 10Aaron Schulz: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [19:08:09] (03CR) 10Aaron Schulz: [C: 032] Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [19:08:24] (03PS1) 10Yuvipanda: tools: Use LDAP servers in HA manner for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/303607 (https://phabricator.wikimedia.org/T142394) [19:08:34] (03Merged) 10jenkins-bot: Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [19:09:13] (03PS2) 10Yuvipanda: tools: Use LDAP servers in HA manner for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/303607 (https://phabricator.wikimedia.org/T142394) [19:11:28] !log restarting kafka broker on 1013 to test more eventlogging rebalances [19:12:42] !log aaron@tin Synchronized wmf-config/db-codfw.php: Switched to pt-heartbeat lag detection on s6 (duration: 00m 51s) [19:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:47] !log aaron@tin Synchronized wmf-config/db-eqiad.php: Switched to pt-heartbeat lag detection on s6 (duration: 00m 53s) [19:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:22] !log aaron@tin Synchronized tests: (no message) (duration: 00m 50s) [19:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:27] (03PS3) 10Yuvipanda: tools: Use LDAP servers in HA manner for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/303607 (https://phabricator.wikimedia.org/T142394) [19:17:05] (03PS3) 10Aaron Schulz: Enable MASTER_GTID_WAIT() on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302635 (https://phabricator.wikimedia.org/T135027) [19:17:56] (03PS4) 10Yuvipanda: tools: Use LDAP servers in HA manner for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/303607 (https://phabricator.wikimedia.org/T142394) [19:18:14] (03CR) 10Aaron Schulz: [C: 032] Enable MASTER_GTID_WAIT() on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302635 (https://phabricator.wikimedia.org/T135027) (owner: 10Aaron Schulz) [19:18:21] !log restarting kafka broker on kafka1022 to test more eventlogging rebalances [19:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:39] (03Merged) 10jenkins-bot: Enable MASTER_GTID_WAIT() on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302635 (https://phabricator.wikimedia.org/T135027) (owner: 10Aaron Schulz) [19:19:26] (03Abandoned) 10Yuvipanda: tools: Add timeout to ldap connection for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/300741 (https://phabricator.wikimedia.org/T141203) (owner: 10Yuvipanda) [19:19:53] !log aaron@tin Synchronized wmf-config/db-codfw.php: Enable MASTER_GTID_WAIT() on s6 (duration: 00m 48s) [19:19:56] (03CR) 10Yuvipanda: [C: 032] tools: Use LDAP servers in HA manner for maintain-kubeusers [puppet] - 10https://gerrit.wikimedia.org/r/303607 (https://phabricator.wikimedia.org/T142394) (owner: 10Yuvipanda) [19:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:51] !log aaron@tin Synchronized wmf-config/db-eqiad.php: Enable MASTER_GTID_WAIT() on s6 (duration: 00m 48s) [19:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:48] 06Operations, 06Labs: Enable root passwords on Labs VMs - https://phabricator.wikimedia.org/T142216#2534618 (10Andrew) Oh, I did not know about generate() -- that's clearly better than what I was imagining. Encryption-wise... I can't convince myself that it's useful. Root access to the puppetmaster already c... [19:23:49] PROBLEM - statsv process on hafnium is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [19:28:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [19:30:09] 06Operations: Emails dropping ffrom greenhouse to Alan - https://phabricator.wikimedia.org/T142427#2534665 (10Krenair) [19:32:08] RECOVERY - statsv process on hafnium is OK: PROCS OK: 13 processes with command name python, args statsv [19:32:14] (03PS1) 10Ottomata: Increase retries for eventlogging analytics processor kafka producers [puppet] - 10https://gerrit.wikimedia.org/r/303610 (https://phabricator.wikimedia.org/T141285) [19:35:11] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05MW-1.28-release-notes, and 4 others: Create Wikipedia Tulu - https://phabricator.wikimedia.org/T140898#2534686 (10Dzahn) [19:35:15] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2534687 (10Dzahn) [19:36:38] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:41:46] (03PS2) 10Ottomata: Increase retries for eventlogging analytics processor kafka producers [puppet] - 10https://gerrit.wikimedia.org/r/303610 (https://phabricator.wikimedia.org/T141285) [19:42:02] (03CR) 10Ottomata: [C: 032 V: 032] Increase retries for eventlogging analytics processor kafka producers [puppet] - 10https://gerrit.wikimedia.org/r/303610 (https://phabricator.wikimedia.org/T141285) (owner: 10Ottomata) [19:42:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [19:45:14] 06Operations, 10Analytics-Cluster: Queries in Hue always return an empty result set - https://phabricator.wikimedia.org/T128039#2534730 (10Ottomata) a:05Ottomata>03None [19:46:53] 06Operations: Analytics nodes are sending non usable metrics to ganglia occasionally - https://phabricator.wikimedia.org/T94671#2534740 (10Ottomata) 05Open>03Invalid We don't send jmx stats to ganglia anymore :) [19:48:30] 06Operations: fuse-dfs problems on stat1002 - https://phabricator.wikimedia.org/T121492#2534747 (10Ottomata) 05Open>03Resolved [19:49:31] 06Operations, 10Analytics-Cluster: Clean up permissions for privatedata files on stat1002 - they should be group readable by statistics-privatedata-users - https://phabricator.wikimedia.org/T89887#2534751 (10Ottomata) a:05Ottomata>03None [19:49:57] 06Operations, 10Analytics-EventLogging: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#2534753 (10Ottomata) a:05Ottomata>03None [19:50:15] !log aaron@tin Synchronized php-1.28.0-wmf.13/includes/db/DatabaseMysqlBase.php: 267c62a530530e (duration: 00m 48s) [19:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:31] 06Operations, 10Analytics-EventLogging: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#1132416 (10Ottomata) 05Open>03declined Setting up eventlogging in codfw doesn't seem very useful, since we won't be setting up the analytics cluster there. Declining this for now. [19:50:52] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#2534758 (10Ottomata) a:05Ottomata>03None [19:52:30] (03PS1) 10Andrew Bogott: Create root passwords for labs instances and store passwords on the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/303617 (https://phabricator.wikimedia.org/T142216) [19:53:41] (03CR) 10jenkins-bot: [V: 04-1] Create root passwords for labs instances and store passwords on the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/303617 (https://phabricator.wikimedia.org/T142216) (owner: 10Andrew Bogott) [19:54:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:55:10] (03PS2) 10Andrew Bogott: Create root passwords for labs instances and store passwords on the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/303617 (https://phabricator.wikimedia.org/T142216) [19:57:51] (03PS3) 10Andrew Bogott: Create root passwords for labs instances and store passwords on the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/303617 (https://phabricator.wikimedia.org/T142216) [19:58:23] 06Operations, 10MediaWiki-Database: periodic spike of MW exceptions "DB connection was already closed or the connection dropped." - https://phabricator.wikimedia.org/T142079#2534791 (10aaron) a:03aaron [19:59:26] (03PS1) 10Aaron Schulz: Use pt-heartbeat and GTIDs on remaning sections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303619 [20:00:05] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160808T2000). Please do the needful. [20:00:20] Nothing to deploy for ORES. [20:00:48] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [20:02:27] !log starting parsoid deploy [20:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:24] (03PS1) 10Dereckson: Set import sources on en.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303621 (https://phabricator.wikimedia.org/T142333) [20:04:36] hmm ... akosiaris mobrovac parsoid code is not syncing in production ... something turned off / need update since the node v4 / jessie upgrade? [20:04:47] 0/44 minions completed fetch [20:06:48] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:07:37] 752 Aug 8 20:03:17 wtp1001 salt-minion[30074]: [ERROR ] The return failed for job 20160808200311543168 global name '__pillar__' is not defined [20:07:40] 753 Aug 8 20:03:17 wtp1001 salt-minion[30074]: [ERROR ] Traceback (most recent call last): [20:08:13] 759 Aug 8 20:03:17 wtp1001 salt-minion[30074]: deployment_config = __pillar__.get('deployment_config') [20:08:17] 760 Aug 8 20:03:17 wtp1001 salt-minion[30074]: NameError: global name '__pillar__' is not defined [20:08:35] i dont know why but that looks related [20:08:46] failure when trying to get deployment config [20:09:51] i don't know what that is about ... [20:12:15] (03CR) 10Dereckson: [C: 04-1] IP cap lift for UN Women Editathon in NYC, Aug 12 (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303571 (https://phabricator.wikimedia.org/T142396) (owner: 10MarcoAurelio) [20:12:40] (03CR) 10Aaron Schulz: [C: 032] Use pt-heartbeat and GTIDs on remaning sections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303619 (owner: 10Aaron Schulz) [20:13:08] (03Merged) 10jenkins-bot: Use pt-heartbeat and GTIDs on remaning sections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303619 (owner: 10Aaron Schulz) [20:13:38] hmm, __pillar__ is a special salt var that is only defined when executed in the salt context. Never seen that before, though. Did anything change with salt recently? [20:14:04] https://docs.saltstack.com/en/latest/topics/development/dunder_dictionaries.html#pillar [20:14:25] thcipriani, mutante so, parsoid cluster got upgraded to node v4 and jessie .. akosiaris has been on it. Not sure if he turned off / changed something config in that process. [20:14:35] separately, parsoid is still using trebuchet for deploy not scap3. [20:14:45] yup [20:14:49] !log aaron@tin Synchronized wmf-config/db-codfw.php: Use pt-heartbeat and GTIDs on remaning sections (duration: 00m 53s) [20:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:15:59] this is the first deploy attempt since the upgrade. although .. i think the codfw cluster might not have been fully updated yet. [20:16:06] !log aaron@tin Synchronized wmf-config/db-eqiad.php: Use pt-heartbeat and GTIDs on remaning sections (duration: 01m 00s) [20:16:21] maybe i should just abort this deploy today and check back tomorrow? there is nothing critical that needs to go out. [20:16:25] (03PS2) 10MarcoAurelio: IP cap lift for UN Women Editathon in NYC, Aug 12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303571 (https://phabricator.wikimedia.org/T142396) [20:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:11] (03PS3) 10MarcoAurelio: IP cap lift for UN Women Editathon in NYC, Aug 12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303571 (https://phabricator.wikimedia.org/T142396) [20:18:04] might be worth a ticket—could be salt-minion version on new boxes? Never seen that error before. [20:19:57] ok . i'll just wait for alex to be done and will check in tomorrow. [20:20:47] (03PS7) 10Paladox: Gerrit: Support labs https [puppet] - 10https://gerrit.wikimedia.org/r/303435 (https://phabricator.wikimedia.org/T141803) [20:21:02] (03PS5) 10MarcoAurelio: User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) [20:21:10] (03PS6) 10MarcoAurelio: User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) [20:21:16] !log aborting parsoid deploy for today (needs akosiaris to take a look at the deployment setup) [20:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:41] AndyRussG: looks like that CN deploy never happened [20:25:19] maybe it was for the afternoon [20:25:31] (03CR) 10Subramanya Sastry: [C: 031] "Yup, we need to separate the service name from the deployment context." [puppet] - 10https://gerrit.wikimedia.org/r/303599 (https://phabricator.wikimedia.org/T141464) (owner: 10Mobrovac) [20:26:40] (03PS1) 10Smalyshev: Make Updater proper service [puppet] - 10https://gerrit.wikimedia.org/r/303626 (https://phabricator.wikimedia.org/T139434) [20:27:51] (03CR) 10jenkins-bot: [V: 04-1] Make Updater proper service [puppet] - 10https://gerrit.wikimedia.org/r/303626 (https://phabricator.wikimedia.org/T139434) (owner: 10Smalyshev) [20:27:58] (03Abandoned) 10Yuvipanda: role: Move quarry to use autolayout [puppet] - 10https://gerrit.wikimedia.org/r/255082 (owner: 10Yuvipanda) [20:29:42] (03CR) 10Dzahn: "sad.." [puppet] - 10https://gerrit.wikimedia.org/r/255082 (owner: 10Yuvipanda) [20:29:46] AaronSchulz: corrrect.... sooon :) [20:30:05] we had a fr incident last night... [20:30:55] mutante I'm going to kill the module in a month or so (going to move it to tools) [20:30:59] (nothing too drastic, but required attention...) [20:31:19] alright yuvipanda [20:33:51] 06Operations, 06Release-Engineering-Team, 15User-greg, 07Wikimedia-Incident: Institute a weekly review of all UBN! tasks - https://phabricator.wikimedia.org/T141130#2534854 (10greg) 05Open>03Resolved >>! In T141130#2531402, @Aklapper wrote: >>>! In T141130#2515799, @greg wrote: >> Should we re-instate... [20:36:15] (03PS1) 10Dzahn: display.php: set pageLength, 50 rows per page [debs/wikistats] - 10https://gerrit.wikimedia.org/r/303686 [20:36:27] (03PS2) 10Smalyshev: Make Updater proper service [puppet] - 10https://gerrit.wikimedia.org/r/303626 (https://phabricator.wikimedia.org/T116754) [20:37:07] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] [20:37:42] (03CR) 10Dzahn: [C: 032 V: 032] display.php: set pageLength, 50 rows per page [debs/wikistats] - 10https://gerrit.wikimedia.org/r/303686 (owner: 10Dzahn) [20:39:09] There are issues with ElasticSearch [20:39:50] Fatal error: Timeout reached waiting for an available pooled curl connection! in CirrusSearch extension (/srv/mediawiki/php-1.28.0-wmf.13/extensions/CirrusSearch/includes/Elastica/PooledHttp.php on line 66) [20:40:10] 06Operations, 06WMF-Legal, 06WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#2534884 (10ZhouZ) So everyone who has a @wikimedia.org account should be on a NDA and can be added to the WMF-NDA group. So going forward, that's all we should need to look for to ad... [20:41:42] Dereckson: based on https://logstash.wikimedia.org/goto/2f8ca338fa165f91ad950d0ebebac94a it looks like two spikes, not unheard of to hit the poolcounter limit at times [20:43:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [20:44:17] greg-g: ok [20:47:03] (03PS3) 10Smalyshev: Make Updater proper service [puppet] - 10https://gerrit.wikimedia.org/r/303626 (https://phabricator.wikimedia.org/T116754) [20:49:48] Dereckson: though it did just happen again... [20:50:50] SMalyshev: (I don't see Erik online), anything we should be worried about with: https://logstash.wikimedia.org/goto/0b9eb446d58e032946927ce3a37e405c ? [20:51:20] Erik's on vacation [20:52:27] greg-g: hmm... dunno. does it happen a lot? [20:53:03] you can change the search window in logstash too ;) [20:53:05] * greg-g does that [20:53:23] https://logstash.wikimedia.org/goto/9818a47c70edcd9fb2526cfe4210387d doesn't look good [20:53:28] it happened a lot in Aug 6th, and now again [20:53:33] maybe pool to small or ES too slow... if dcausse is still not sleeping, maybe he can check what's up with ES servers [20:54:21] wow 14 fails, not good [20:54:37] I'd add it in phab. [20:56:48] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [20:58:27] gehel, ^ [20:58:58] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:00:04] dapatrick and bawolff: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160808T2100). [21:00:37] 06Operations, 10Cassandra: xenon.eqiad.wmnet: very high cpu utilization - https://phabricator.wikimedia.org/T141675#2534955 (10Eevans) 05Open>03Resolved a:03Eevans >>! In T141675#2534202, @Dzahn wrote: > Probably something in between. I would say close it now and if it never happens again that is it. But... [21:00:38] MaxSem: looking back... [21:01:09] 06Operations, 10Cassandra: acpi_pad runaway processes on praseodymium - https://phabricator.wikimedia.org/T123924#2534970 (10Eevans) [21:01:48] 06Operations, 10Cassandra: acpi_pad runaway processes on praseodymium - https://phabricator.wikimedia.org/T123924#1941018 (10Eevans) a:03fgiunchedi [21:02:16] 06Operations, 10Traffic, 13Patch-For-Review, 05WMF-deploy-2016-08-09_(1.28.0-wmf.14): Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2534974 (10Krinkle) >>! In T107430#2520799, @Krinkle wrote: > The Commons app for Android (previously by Wikimedia, now community-maintained) a... [21:02:28] greg-g: I see pool of 20 connections in hhvm config, so I wonder maybe makes sense bumping it? not sure [21:02:49] SMalyshev: not a question for me :) [21:03:04] right. gehel, are you still around? [21:03:22] SMalyshev, bumping pool size in response to a suddenly appearing problem is fixing the symptoms [21:03:23] 06Operations, 10Cassandra: acpi_pad runaway processes on praseodymium - https://phabricator.wikimedia.org/T123924#2534977 (10Eevans) 05Open>03Resolved Let's go ahead and close this for now, and if it ever resurfaces, we can try blacklisting `acpi_pad`. [21:03:24] SMalyshev: connecting... [21:04:22] MaxSem: I know, but since Erik is not here, it may be better to suppress the symptoms until he is. Unless somebody else knows enough to research it of course [21:04:50] maybe dcausse... because I have no idea why Elastic suddenly wants much more connections than before [21:05:37] and I mean thoudands of them... shouldn't pool counter gateway this? [21:05:44] * SMalyshev is out of my depth on this [21:06:00] SMalyshev: thanks for looking/respond to ping :) [21:06:36] SMalyshev: ok, I'm here for real (sorry, laptop took time to boot) [21:07:16] gehel: so logstash shows spikes in timeouts for curl https pools [21:07:30] e.g. https://logstash.wikimedia.org/goto/0b9eb446d58e032946927ce3a37e405c [21:07:45] or https://logstash.wikimedia.org/goto/9818a47c70edcd9fb2526cfe4210387d [21:07:49] SMalyshev: we had issues saturday with elasticsearch slowing down [21:08:23] gehel: so you say maybe it's elastic being slow and it can be ignored? [21:09:05] SMalyshev: nope, I'm saying that we probably have a more systemic issue than I first thought Saturday... [21:09:42] anyway, I guess you probably know about it more than me so you could keep an eye on it and see if we need to make bigger pools or look into ES or maybe counters? [21:10:02] but if it is elasticsearch slowing down, increasing pool size on HHVM size will probably make things worse, not better [21:12:15] probably [21:13:34] Friends--I need help checking Phabricator web logs to follow up on a fundraising data breach. Please PM [21:13:53] gehel: could we check if these spikes are mostly from jobrunners0? [21:14:23] dcausse: good idea! [21:16:55] it seems to be mainly API servers [21:18:44] gehel: ok thanks, [21:19:36] and this time elasticsearch itself does not seem to have slown down [21:20:08] only 2 small surges in search thread pool queue [21:20:18] but no rejection [21:21:57] so it seems to be a different issue than Saturday. Not sure if it is good news or not [21:22:11] elastic1038 apparently... strange [21:24:22] how could we see if we have different traffic pattern on API? [21:25:05] ftr: I responded to awight in pm [21:25:48] It's awesome! And I'm only paying for the volunteer-level SLA [21:25:52] curl pool exhaustion on HHVM side without overloading elastic let me think we have something related to traffic. Longer running queries via API? [21:28:26] gehel: probable, but elasticsearch thread pool seemed to have handled the spike so maybe hhvm pool size is too low? [21:29:49] dcausse: yep, could be... it would be good to have metrics on those curl pools to see what the nominal usage is... [21:30:34] yes [21:31:32] dcausse: should we do anything tonight? [21:32:16] I'm not very keen on changing curl pool size like that, whenit was working fine for a fairly long time... [21:32:21] gehel: no it's late, and I need to investigate more to really understand what happened [21:32:50] dcausse: ok, aggreed, let's dig more into this tomorrow... [21:33:04] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [21:34:52] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [21:49:56] jouncebot: next [21:49:56] In 1 hour(s) and 10 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160808T2300) [21:51:40] 06Operations, 10Flow, 10MediaWiki-Redirects, 03Collab-Team-Q1-July-Sep-2016, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1486874 (10Catrope) Could {T72318} be related to this? [21:53:22] 06Operations, 10MediaWiki-Database: periodic spike of MW exceptions "DB connection was already closed or the connection dropped." - https://phabricator.wikimedia.org/T142079#2535078 (10aaron) Only seems to happen from ChangeNotification jobs... [21:55:14] !log Updated Wikidata's property suggester with data from today's json dump and removed the external identifiers as a workaround for T132839 [21:55:15] T132839: Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [21:55:18] sjoerddebruin: ^ [21:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:31] \o/ [21:57:53] !log Deployed patch for T130384 to wmf.13 [21:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:05:03] (03CR) 10GWicke: "> The name is drawn from $title, but, as stated in the linked bug, the logging name and statsd prefix need to be changed. The service's na" [puppet] - 10https://gerrit.wikimedia.org/r/303599 (https://phabricator.wikimedia.org/T141464) (owner: 10Mobrovac) [22:13:42] 06Operations, 06Release-Engineering-Team, 15User-greg, 07Wikimedia-Incident: Institute quarterly(?) review of incident reports and follow-up - https://phabricator.wikimedia.org/T141287#2535154 (10greg) a:05greg>03None [22:13:48] (03CR) 10Dzahn: [C: 04-1] "let's please try one of these solutions instead: a) get floating IP in "git" project b) get access for you to "staging" project" [puppet] - 10https://gerrit.wikimedia.org/r/303435 (https://phabricator.wikimedia.org/T141803) (owner: 10Paladox) [22:15:06] (03PS1) 10BBlack: openssl (1.0.2h-1~wmf4) jessie-wikimedia; urgency=medium [debs/openssl] - 10https://gerrit.wikimedia.org/r/303700 [22:30:20] 06Operations, 10Mail: Emails dropping from Greenhouse to Alan - https://phabricator.wikimedia.org/T142427#2535211 (10Danny_B) [22:32:13] (03PS1) 10Dzahn: gerrit: delete previously used gerrit SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/303705 (https://phabricator.wikimedia.org/T142131) [22:36:54] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: puppet fail [22:38:09] (03PS1) 10ArielGlenn: generate and email a few fun dump-related stats each month [puppet] - 10https://gerrit.wikimedia.org/r/303707 (https://phabricator.wikimedia.org/T142435) [22:42:05] (03CR) 10Dzahn: [C: 032] gerrit: delete previously used gerrit SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/303705 (https://phabricator.wikimedia.org/T142131) (owner: 10Dzahn) [22:42:11] (03PS2) 10Dzahn: gerrit: delete previously used gerrit SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/303705 (https://phabricator.wikimedia.org/T142131) [22:48:23] 06Operations, 06Release-Engineering-Team, 10Traffic, 07HTTPS: Retire gerrit.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T142131#2523742 (10Dzahn) deleted cert from public repo. deleted key from private repo. this should be all then [22:49:23] (03PS6) 10ArielGlenn: Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 [22:49:39] (03CR) 10jenkins-bot: [V: 04-1] Make scheduler hupable. [dumps] - 10https://gerrit.wikimedia.org/r/302831 (owner: 10ArielGlenn) [22:50:31] well, no one can complain about me not shoveling my crap into gerrit at least [22:50:34] broken and all :-P [22:50:38] and now.... bedtime [22:51:12] kalimera apergos [22:51:28] καληνύχτα! [22:52:15] 06Operations, 06Release-Engineering-Team, 10Traffic, 07HTTPS: Retire gerrit.wikimedia.org SSL cert - https://phabricator.wikimedia.org/T142131#2535257 (10Dzahn) 05Open>03Resolved a:03Dzahn /etc/ssl/localcerts and /etc/ssl/private on lead were already cleaned [22:53:51] (03CR) 10Dzahn: "key has also been deleted just now" [puppet] - 10https://gerrit.wikimedia.org/r/303705 (https://phabricator.wikimedia.org/T142131) (owner: 10Dzahn) [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160808T2300). Please do the needful. [23:00:04] Jdlrobson, RoanKattouw, MatmaRex, Dereckson, mafk, and dapatrick: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:21] * MarcoA salutes [23:00:28] hi. [23:00:35] * mafk salutes again - present [23:00:46] time to break teh wikiz [23:01:16] \o [23:01:27] uh, that's a lot of patches today. [23:02:07] I scheduled ahem... 6 /hides [23:02:23] * RoanKattouw is here [23:03:02] who's SWATing? :) [23:03:04] 06Operations, 10hardware-requests: eqiad: (4) spare pool servers for kubernetes - https://phabricator.wikimedia.org/T141624#2535282 (10RobH) p:05Normal>03Low It was noted during today's operations meeting that these systems are part of what is needed for one of the team's stretch goals, but it is by far th... [23:03:14] I'm here. [23:03:22] (03CR) 10BBlack: [C: 032] "Confirmed simulated client behaviors manually and via ssllabs.com, works as expected." [debs/openssl] - 10https://gerrit.wikimedia.org/r/303700 (owner: 10BBlack) [23:03:58] (03PS1) 10Yuvipanda: tools: Send logs to multiple servers [puppet] - 10https://gerrit.wikimedia.org/r/303711 [23:04:33] RECOVERY - puppet last run on mw2159 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:04:59] (03CR) 10jenkins-bot: [V: 04-1] tools: Send logs to multiple servers [puppet] - 10https://gerrit.wikimedia.org/r/303711 (owner: 10Yuvipanda) [23:05:41] !log uploaded to carbon jessie-wikimedia: openssl-1.0.2h-1~wmf4 ( https://gerrit.wikimedia.org/r/#/c/303700/ ) [23:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:34] I'll do it [23:08:51] Thanks RoanKattouw [23:09:19] dapatrick: Hmm, do you mind if I don't SWAT your OAuth patch but make it ride the train? I'd like to wait for https://gerrit.wikimedia.org/r/#/c/303392 [23:09:51] Also it's a patch with new i18n messages and that's always a bit of a trap [23:10:49] RoanKattouw How soon will 303392 hit? [23:11:14] I'll merge it today so it'll go in this week's train, same as your main patch [23:11:28] (03CR) 10Dzahn: [C: 04-1] "like qchris said, needs consensus on the ticket" [puppet] - 10https://gerrit.wikimedia.org/r/302482 (https://phabricator.wikimedia.org/T91001) (owner: 10Paladox) [23:11:31] So Tuesday for mw.org and Wednesday for metawiki [23:11:31] !log openssl-1.0.2h-1~wmf4 -> caches [23:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:41] (Not sure which of those two is the central OAuth wiki) [23:11:51] meta [23:12:14] oauth admins are on meta [23:12:20] (03CR) 10Dzahn: [C: 04-1] "let's get a public IP (like we have in the staging project) and just use LE and have https and be done" [puppet] - 10https://gerrit.wikimedia.org/r/303146 (https://phabricator.wikimedia.org/T141803) (owner: 10Paladox) [23:12:29] OK, so then Wednesday [23:12:43] And RoanKattouw, I have two patches listed. Which are you speaking of specifically? [23:12:59] "Send OAuth notifications about app management events" [23:13:11] The one that cherry-picks the entire feature into wmf.13 [23:13:32] (03CR) 10Catrope: [C: 032] Promote language switcher to top of page in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302999 (https://phabricator.wikimedia.org/T138961) (owner: 10Jdlrobson) [23:13:51] Sorry for not specifying, I hadn't noticed the second patch at first [23:14:11] (03CR) 10Dzahn: "how did you get around this now on the test instance?" [puppet] - 10https://gerrit.wikimedia.org/r/303355 (owner: 10Paladox) [23:14:21] Okay, will what you're proposing ensure that they are deployed in concert? [23:14:28] *Together [23:14:46] Hmm [23:14:59] We could deploy the config patch now, and that will be a no-op until the code rolls out, right? [23:15:12] Yes. [23:15:16] Is that usual practice? [23:15:34] Not quite, usually we dark-deploy first, then enable [23:15:52] What would happen if we rolled out the code first and the config second? Would things behave sensibly in the interim? [23:16:09] (03PS2) 10Catrope: Promote language switcher to top of page in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302999 (https://phabricator.wikimedia.org/T138961) (owner: 10Jdlrobson) [23:16:12] (03CR) 10Catrope: [V: 032] Promote language switcher to top of page in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302999 (https://phabricator.wikimedia.org/T138961) (owner: 10Jdlrobson) [23:16:21] (03CR) 10Catrope: [C: 032] Promote language switcher to top of page in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302999 (https://phabricator.wikimedia.org/T138961) (owner: 10Jdlrobson) [23:16:28] Yes. It would also be a no-op. [23:16:40] OK, then let's do it that way [23:16:48] (03Merged) 10jenkins-bot: Promote language switcher to top of page in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302999 (https://phabricator.wikimedia.org/T138961) (owner: 10Jdlrobson) [23:17:00] Okay, so. [23:17:04] Which one are you deploying today? [23:17:07] i.e. move the config change to Wednesday's SWAT, and drop the wmf.13 patch [23:17:10] (03CR) 10Paladox: "use https" [puppet] - 10https://gerrit.wikimedia.org/r/303355 (owner: 10Paladox) [23:17:15] owner notifications are not dependent on the config change [23:17:20] Oh? [23:17:27] What is dependent on the config change then? Just admin notifs? [23:17:29] that part is not behind any feature flag [23:17:30] yes [23:17:34] Alright [23:17:44] I think we should just deploy the config patch today then [23:17:53] And have both features kick in at the same time with Wednesday's train [23:18:00] (03CR) 10Dzahn: [C: 04-1] "pretty sure the db password is in labs/private repo. and it should just be set there or in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/303171 (owner: 10Paladox) [23:18:27] (03CR) 10Paladox: "oh" [puppet] - 10https://gerrit.wikimedia.org/r/303171 (owner: 10Paladox) [23:18:28] Okay, hang on. RoanKattouw, can you re-explain why you are not deploying both today? [23:18:48] Because in order to deploy it today, the whole feature needs to be backported to the wmf13 branch [23:18:56] And deploying backports that include i18n is a pain [23:19:28] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Promote language switcher to top of page on ruwiki (T138961) (duration: 00m 59s) [23:19:29] T138961: Deploy mobile language button placement improvement to a larger-medium wiki on Tuesday week two of sprint - https://phabricator.wikimedia.org/T138961 [23:19:31] I'll do it if I have to or if there's a reason you want this deployed ASAP, but if waiting till Wednesday is not a big deal then I'd prefer that [23:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:55] jdlrobson: Please verify ---^^ [23:20:18] Dereckson: You around? You have a patch listed for SWAT [23:20:29] RoanKattouw: on mw1099.eqiad.wmnet ? [23:20:36] (03Abandoned) 10Paladox: Gerrit: allows us to choose our auth type since production ldap does not work in labs [puppet] - 10https://gerrit.wikimedia.org/r/303355 (owner: 10Paladox) [23:20:37] jdlrobson: Crap, sorry, no everywhere [23:20:47] I forgot about the mw1099 thing, this is my first SWAT since that change [23:20:48] * RoanKattouw reads docs [23:21:32] hi RoanKattouw [23:21:35] (03PS4) 10Dpatrick: Enable notifications for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303333 (https://phabricator.wikimedia.org/T61772) [23:21:42] (03Abandoned) 10Paladox: Gerrit: Allow users to customise there db password [puppet] - 10https://gerrit.wikimedia.org/r/303171 (owner: 10Paladox) [23:21:51] (03CR) 10Catrope: [C: 032] Enable notifications for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303333 (https://phabricator.wikimedia.org/T61772) (owner: 10Dpatrick) [23:22:15] (03Merged) 10jenkins-bot: Enable notifications for OAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303333 (https://phabricator.wikimedia.org/T61772) (owner: 10Dpatrick) [23:22:54] RoanKattouw Okay. I understand that. Wednesday is fine. Thank you for explaining. [23:23:00] RoanKattouw: doesn't seem to have had the desired effect.. [23:23:11] debugging [23:23:18] (03CR) 10Dzahn: [C: 04-1] "this would still have the SSL settings in Apache config, causing errors. and i think it's a duplicate of another change that already does " [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [23:23:28] Shoot RoanKattouw [23:23:33] there's a typo in the change :/ [23:23:36] hahaha [23:23:39] OK, submit a followup [23:23:42] Should be wgMinervaUsePageActionBarV2 not wgMFMinervaUsePageActionBarV2 [23:23:48] (03Abandoned) 10Paladox: Gerrit: Make SSL optional for proxy setup [puppet] - 10https://gerrit.wikimedia.org/r/302852 (owner: 10Paladox) [23:23:51] I'll proceed with dapatrick's change, then Dereckson's, then your typo fix [23:25:02] (03PS1) 10Jdlrobson: Fix typo for Russian language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303713 (https://phabricator.wikimedia.org/T129505) [23:25:09] ^ RoanKattouw that should do it [23:28:04] hah! [23:28:18] Stared at dapatrick's non-working config change for minutes and found a bug in it too [23:28:37] * RoanKattouw fixes [23:29:47] (03CR) 10Dzahn: "20after4 do you wanna give us a "go" for this?" [puppet] - 10https://gerrit.wikimedia.org/r/301863 (owner: 10Chad) [23:30:58] (03PS1) 10Catrope: Follow-up 398cb5368: add missing global $wgOAuthGroupsToNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303715 (https://phabricator.wikimedia.org/T61772) [23:31:08] (03CR) 10Catrope: [C: 032] Follow-up 398cb5368: add missing global $wgOAuthGroupsToNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303715 (https://phabricator.wikimedia.org/T61772) (owner: 10Catrope) [23:31:22] (03PS2) 10Yuvipanda: tools: Send logs to multiple servers [puppet] - 10https://gerrit.wikimedia.org/r/303711 [23:31:35] (03Merged) 10jenkins-bot: Follow-up 398cb5368: add missing global $wgOAuthGroupsToNotify [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303715 (https://phabricator.wikimedia.org/T61772) (owner: 10Catrope) [23:33:42] (03CR) 10Catrope: [C: 032] Set import sources on en.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303621 (https://phabricator.wikimedia.org/T142333) (owner: 10Dereckson) [23:33:49] (03PS2) 10Catrope: Set import sources on en.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303621 (https://phabricator.wikimedia.org/T142333) (owner: 10Dereckson) [23:33:52] (03CR) 10Catrope: [C: 032] Set import sources on en.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303621 (https://phabricator.wikimedia.org/T142333) (owner: 10Dereckson) [23:34:17] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Enable OAuth notifications in config (no-op until Wednesday) (T61172) (T62528) (duration: 00m 50s) [23:34:19] (03Merged) 10jenkins-bot: Set import sources on en.wikibooks.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303621 (https://phabricator.wikimedia.org/T142333) (owner: 10Dereckson) [23:34:19] T61172: Search suggestions cause performance hit on slower machines - https://phabricator.wikimedia.org/T61172 [23:34:20] T62528: Notify owners when an OAuth app changes state - https://phabricator.wikimedia.org/T62528 [23:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:54] dapatrick: OK, your config patch is deployed [23:35:11] (03PS3) 10Yuvipanda: tools: Send logs to multiple servers [puppet] - 10https://gerrit.wikimedia.org/r/303711 [23:35:16] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Send logs to multiple servers [puppet] - 10https://gerrit.wikimedia.org/r/303711 (owner: 10Yuvipanda) [23:35:19] Dereckson: Your import sources patch is live on mw1099, please test [23:35:47] (03PS1) 10Yuvipanda: tools: Don't send per-tool http version stats [puppet] - 10https://gerrit.wikimedia.org/r/303716 [23:38:42] RoanKattouw: en.wikibooks doesn't throw me an error on Special:Import, can't test further, but looks good [23:38:56] I'll ask in the task original requester to check sources are well there [23:38:56] Dereckson: I can help testing further if required [23:39:09] RoanKattouw Okay, thanks! [23:39:16] mafk: https://en.wikibooks.org/wiki/Special:Import commons and oldwikisource must now be available [23:40:32] Dereckson: comons and OldWikisource are present in the dropdown for Special:Import [23:40:37] *commons [23:41:23] so I guess it'll work [23:41:24] RoanKattouw: so tested, works fine [23:41:26] thanks mafk [23:41:33] :D [23:41:36] RoanKattouw: https://gerrit.wikimedia.org/r/303713 on your radar? I just need to rush to bathroom. brb [23:41:39] You're welcome. [23:41:53] jdlrobson: Yes, you're next [23:41:57] Dereckson: Thanks, will sync next [23:42:33] !log catrope@tin Synchronized php-1.28.0-wmf.13/extensions/Echo/modules/nojs/mw.echo.badge.monobook.less: Hack around IE bug breaking badge alignment in Monobook (T142053) (duration: 00m 50s) [23:42:35] T142053: Notification badges misaligned in Monobook in IE11 and below - https://phabricator.wikimedia.org/T142053 [23:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:43:55] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Set import sources on enwikibooks (T142333) (duration: 00m 49s) [23:43:56] T142333: Add additional (commons/wikisource) transwiki import sites on en.wikibooks.org - https://phabricator.wikimedia.org/T142333 [23:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:28] (03CR) 10Catrope: [C: 032] Fix typo for Russian language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303713 (https://phabricator.wikimedia.org/T129505) (owner: 10Jdlrobson) [23:45:09] (03PS2) 10Catrope: Fix typo for Russian language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303713 (https://phabricator.wikimedia.org/T129505) (owner: 10Jdlrobson) [23:45:13] (03CR) 10Catrope: [C: 032] Fix typo for Russian language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303713 (https://phabricator.wikimedia.org/T129505) (owner: 10Jdlrobson) [23:45:44] (03Merged) 10jenkins-bot: Fix typo for Russian language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303713 (https://phabricator.wikimedia.org/T129505) (owner: 10Jdlrobson) [23:46:25] jdlrobson: OK, your Minerva/MF typo fix is on mw1099, please test [23:47:25] (03CR) 10Dzahn: [C: 032] Bundle jquery 1.11.3 [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) (owner: 10Reedy) [23:47:35] (03CR) 10Dzahn: [V: 032] Bundle jquery 1.11.3 [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) (owner: 10Reedy) [23:47:36] MatmaRex: Your SpecialNewFiles patch is also on mw1099, please test [23:48:19] mafk: Sorry for the delay, I'll do all of yours in one go once jdlrobson's patch is done [23:48:38] RoanKattouw: ok, I can't sleep nonetheless [23:48:58] doing [23:49:20] RoanKattouw: lgtm [23:49:52] (03PS9) 10BryanDavis: [WIP] Provision Striker via scap3 [puppet] - 10https://gerrit.wikimedia.org/r/301505 (https://phabricator.wikimedia.org/T141014) [23:50:52] !log catrope@tin Synchronized php-1.28.0-wmf.13/includes/specials/SpecialNewimages.php: Restore the newimagestext message (T142191) (duration: 00m 47s) [23:50:53] T142191: MediaWiki:Newimagestext missing at commons - https://phabricator.wikimedia.org/T142191 [23:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:20] RoanKattouw: confirmed working for logged in users.. but not anons yet so i'm hoping that's a cache issue. please bear with me. [23:51:37] jdlrobson: Can I go to all servers in the meantime? [23:51:48] oh! do that [23:51:53] i thought you'd done it to all again [23:51:57] No, sorry [23:51:57] that makes more sense [23:52:02] my headers do not run in incognito mode [23:52:31] RoanKattouw: yep it's working [23:52:34] you can release it everywhere :) [23:52:35] OK, cool [23:52:38] Doing that [23:52:52] (03PS7) 10Catrope: User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) (owner: 10MarcoAurelio) [23:53:08] (03CR) 10Krinkle: [C: 031] Rename 'autoreview' to 'autopatrolled' on mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302987 (owner: 10MarcoAurelio) [23:53:15] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Fix typo in Russian language switcher config (T129505) (duration: 00m 48s) [23:53:16] T129505: Ship mobile web readily available language button placement affordance on Wednesday immediately following Tuesday single-language deployment - https://phabricator.wikimedia.org/T129505 [23:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:29] !log git pull on dbtree/db1152, there were previous changes that did not get deployed [23:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:35] (03CR) 10Catrope: [C: 032] User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) (owner: 10MarcoAurelio) [23:54:00] (03Merged) 10jenkins-bot: User rights configuration changes for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302974 (https://phabricator.wikimedia.org/T142123) (owner: 10MarcoAurelio) [23:54:11] (03PS4) 10Catrope: Cleaning a bit frwiktionary's botadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 (owner: 10MarcoAurelio) [23:54:17] (03CR) 10Catrope: [C: 032] Cleaning a bit frwiktionary's botadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 (owner: 10MarcoAurelio) [23:54:23] 06Operations, 10Ops-Access-Requests, 10Fundraising-Backlog: Access request: AWight access to iridium - https://phabricator.wikimedia.org/T142446#2535387 (10awight) [23:54:42] (03Merged) 10jenkins-bot: Cleaning a bit frwiktionary's botadmin rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/302990 (owner: 10MarcoAurelio) [23:55:20] (03PS3) 10Catrope: Remove 'centralauth-usermerge' from stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303150 (owner: 10MarcoAurelio) [23:55:22] (03CR) 10Catrope: [C: 032] Remove 'centralauth-usermerge' from stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303150 (owner: 10MarcoAurelio) [23:55:49] (03Merged) 10jenkins-bot: Remove 'centralauth-usermerge' from stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303150 (owner: 10MarcoAurelio) [23:55:53] (03PS1) 10Dzahn: dbtree: ensure puppet is deploying changes [puppet] - 10https://gerrit.wikimedia.org/r/303719 [23:56:26] (03PS4) 10Catrope: Cleanup of IP throttles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303312 (owner: 10MarcoAurelio) [23:56:28] (03CR) 10Dzahn: "deployed on mw1152. puppet has git clone but no "ensure latest". there were other older changes that had been merged but were not deployed" [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) (owner: 10Reedy) [23:56:30] (03CR) 10Catrope: [C: 032] Cleanup of IP throttles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303312 (owner: 10MarcoAurelio) [23:56:55] (03Merged) 10jenkins-bot: Cleanup of IP throttles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303312 (owner: 10MarcoAurelio) [23:57:25] RoanKattouw: don't merge the mediawiki change yet please [23:57:53] it requires a script running and so I'd like to do it the last one [23:58:24] Which one? [23:58:32] https://gerrit.wikimedia.org/r/#/c/302987/ ? [23:58:32] autoreview -> autopatrolled [23:58:40] OK, I'll leave that one out at first [23:59:18] once I can test all of those, we can move to that one [23:59:20] (03PS4) 10Catrope: IP cap lift for UN Women Editathon in NYC, Aug 12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303571 (https://phabricator.wikimedia.org/T142396) (owner: 10MarcoAurelio) [23:59:59] (03CR) 10Catrope: [C: 032] IP cap lift for UN Women Editathon in NYC, Aug 12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/303571 (https://phabricator.wikimedia.org/T142396) (owner: 10MarcoAurelio)